diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 000000000..e69de29bb diff --git a/01-rstudio-intro.html b/01-rstudio-intro.html new file mode 100644 index 000000000..3729bdb12 --- /dev/null +++ b/01-rstudio-intro.html @@ -0,0 +1,1469 @@ + +R for Reproducible Scientific Analysis: Introduction to R and RStudio +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Introduction to R and RStudio

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How to find your way around RStudio?
  • +
  • How to interact with R?
  • +
  • How to manage your environment?
  • +
  • How to install packages?
  • +
+
+
+
+
+
+

Objectives

+
  • Describe the purpose and use of each pane in the RStudio IDE
  • +
  • Locate buttons and options in the RStudio IDE
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Manage a workspace in an interactive R session
  • +
  • Use mathematical and comparison operators
  • +
  • Call functions
  • +
  • Manage packages
  • +
+
+
+
+
+

Motivation +

+

Science is a multi-step process: once you’ve designed an experiment +and collected data, the real fun begins! This lesson will teach you how +to start this process using R and RStudio. We will begin with raw data, +perform exploratory analyses, and learn how to plot results graphically. +This example starts with a dataset from gapminder.org containing population +information for many countries through time. Can you read the data into +R? Can you plot the population for Senegal? Can you calculate the +average income for countries on the continent of Asia? By the end of +these lessons you will be able to do things like plot the populations +for all of these countries in under a minute!

+

Before Starting The Workshop +

+

Please ensure you have the latest version of R and RStudio installed +on your machine. This is important, as some packages used in the +workshop may not install correctly (or at all) if R is not up to +date.

+

Introduction to RStudio +

+

Welcome to the R portion of the Software Carpentry workshop.

+

Throughout this lesson, we’re going to teach you some of the +fundamentals of the R language as well as some best practices for +organizing code for scientific projects that will make your life +easier.

+

We’ll be using RStudio: a free, open-source R Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms +(including on servers) and provides many advantages such as integration +with version control and project management.

+

Basic layout

+

When you first open RStudio, you will be greeted by three panels:

+
  • The interactive R console/Terminal (entire left)
  • +
  • Environment/History/Connections (tabbed in upper right)
  • +
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • +
RStudio layout

Once you open files, such as R scripts, an editor panel will also +open in the top left.

+
RStudio layout with .R file open
+
+ +
+
+

R scripts +

+
+

Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have .R at the end of their +names to let you know what they are.

+
+
+
+

Workflow within RStudio +

+

There are two main ways one can work within RStudio:

+
  1. Test and play within the interactive R console then copy code into a +.R file to run later.
  2. +
  • This works well when doing small tests and initially starting +off.
  • +
  • It quickly becomes laborious
  • +
  1. Start writing in a .R file and use RStudio’s short cut keys for the +Run command to push the current line, selected lines or modified lines +to the interactive R console.
  2. +
  • This is a great way to start; all your code is saved for later
  • +
  • You will be able to run the file you create from within RStudio or +using R’s source() function.
  • +
+
+ +
+
+

Tip: Running segments of your code +

+
+

RStudio offers you great flexibility in running code from within the +editor window. There are buttons, menu choices, and keyboard shortcuts. +To run the current line, you can

+
  1. click on the Run button above the editor panel, or
  2. +
  3. select “Run Lines” from the “Code” menu, or
  4. +
  5. hit Ctrl+Return in Windows or Linux or ++Return on OS X. (This shortcut can also be seen +by hovering the mouse over the button). To run a block of code, select +it and then Run. If you have modified a line of code within +a block of code you have just run, there is no need to reselect the +section and Run, you can use the next button along, +Re-run the previous region. This will run the previous code +block including the modifications you have made.
  6. +
+
+
+

Introduction to R +

+

Much of your time in R will be spent in the R interactive console. +This is where you will run all of your code, and can be a useful +environment to try out ideas before adding them to an R script file. +This console in RStudio is the same as the one you would get if you +typed in R in your command-line environment.

+

The first thing you will see in the R interactive session is a bunch +of information, followed by a “>” and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a “Read, evaluate, print +loop”: you type in commands, R tries to execute them, and then returns a +result.

+

Using R as a calculator +

+

The simplest thing you could do with R is to do arithmetic:

+
+

R +

+
+1 + 100
+
+
+

OUTPUT +

+
[1] 101
+
+

And R will print out the answer, with a preceding “[1]”. [1] is the +index of the first element of the line being printed in the console. For +more information on indexing vectors, see Episode +6: Subsetting Data.

+

If you type in an incomplete command, R will wait for you to complete +it. If you are familiar with Unix Shell’s bash, you may recognize +this
+behavior from bash.

+
+

R +

+
> 1 +
+
+
+

OUTPUT +

+
+
+
+

Any time you hit return and the R session shows a “+” instead of a +“>”, it means it’s waiting for you to complete the command. If you +want to cancel a command you can hit Esc and RStudio will +give you back the “>” prompt.

+
+
+ +
+
+

Tip: Canceling commands +

+
+

If you’re using R from the command line instead of from within +RStudio, you need to use Ctrl+C instead of +Esc to cancel the command. This applies to Mac users as +well!

+

Canceling a command isn’t only useful for killing incomplete +commands: you can also use it to tell R to stop running code (for +example if it’s taking much longer than you expect), or to get rid of +the code you’re currently writing.

+
+
+
+

When using R as a calculator, the order of operations is the same as +you would have learned back in school.

+

From highest to lowest precedence:

+
  • Parentheses: (, ) +
  • +
  • Exponents: ^ or ** +
  • +
  • Multiply: * +
  • +
  • Divide: / +
  • +
  • Add: + +
  • +
  • Subtract: - +
  • +
+

R +

+
+3 + 5 * 2
+
+
+

OUTPUT +

+
[1] 13
+
+

Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend.

+
+

R +

+
+(3 + 5) * 2
+
+
+

OUTPUT +

+
[1] 16
+
+

This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code.

+
+

R +

+
+(3 + (5 * (2 ^ 2))) # hard to read
+3 + 5 * 2 ^ 2       # clear, if you remember the rules
+3 + 5 * (2 ^ 2)     # if you forget some rules, this might help
+
+

The text after each line of code is called a “comment”. Anything that +follows after the hash (or octothorpe) symbol # is ignored +by R when it executes code.

+

Really small or large numbers get a scientific notation:

+
+

R +

+
+2/10000
+
+
+

OUTPUT +

+
[1] 2e-04
+
+

Which is shorthand for “multiplied by 10^XX”. So +2e-4 is shorthand for 2 * 10^(-4).

+

You can write numbers in scientific notation too:

+
+

R +

+
+5e3  # Note the lack of minus here
+
+
+

OUTPUT +

+
[1] 5000
+
+

Mathematical functions +

+

R has many built in mathematical functions. To call a function, we +can type its name, followed by open and closing parentheses. Functions +take arguments as inputs, anything we type inside the parentheses of a +function is considered an argument. Depending on the function, the +number of arguments can vary from none to multiple. For example:

+
+

R +

+
+getwd() #returns an absolute filepath
+
+

doesn’t require an argument, whereas for the next set of mathematical +functions we will need to supply the function a value in order to +compute the result.

+
+

R +

+
+sin(1)  # trigonometry functions
+
+
+

OUTPUT +

+
[1] 0.841471
+
+
+

R +

+
+log(1)  # natural logarithm
+
+
+

OUTPUT +

+
[1] 0
+
+
+

R +

+
+log10(10) # base-10 logarithm
+
+
+

OUTPUT +

+
[1] 1
+
+
+

R +

+
+exp(0.5) # e^(1/2)
+
+
+

OUTPUT +

+
[1] 1.648721
+
+

Don’t worry about trying to remember every function in R. You can +look them up on Google, or if you can remember the start of the +function’s name, use the tab completion in RStudio.

+

This is one advantage that RStudio has over R on its own, it has +auto-completion abilities that allow you to more easily look up +functions, their arguments, and the values that they take.

+

Typing a ? before the name of a command will open the +help page for that command. When using RStudio, this will open the +‘Help’ pane; if using R in the terminal, the help page will open in your +browser. The help page will include a detailed description of the +command and how it works. Scrolling to the bottom of the help page will +usually show a collection of code examples which illustrate command +usage. We’ll go through an example later.

+

Comparing things +

+

We can also do comparisons in R:

+
+

R +

+
+1 == 1  # equality (note two equals signs, read as "is equal to")
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 != 2  # inequality (read as "is not equal to")
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 < 2  # less than
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 <= 1  # less than or equal to
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 > 0  # greater than
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 >= -9 # greater than or equal to
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+
+ +
+
+

Tip: Comparing Numbers +

+
+

A word of warning about comparing numbers: you should never use +== to compare two numbers unless they are integers (a data +type which can specifically represent only whole numbers).

+

Computers may only represent decimal numbers with a certain degree of +precision, so two numbers which look the same when printed out by R, may +actually have different underlying representations and therefore be +different by a small margin of error (called Machine numeric +tolerance).

+

Instead you should use the all.equal function.

+

Further reading: http://floating-point-gui.de/

+
+
+
+

Variables and assignment +

+

We can store values in variables using the assignment operator +<-, like this:

+
+

R +

+
+x <- 1/40
+
+

Notice that assignment does not print a value. Instead, we stored it +for later in something called a variable. +x now contains the value +0.025:

+
+

R +

+
+x
+
+
+

OUTPUT +

+
[1] 0.025
+
+

More precisely, the stored value is a decimal approximation +of this fraction called a floating point +number.

+

Look for the Environment tab in the top right panel of +RStudio, and you will see that x and its value have +appeared. Our variable x can be used in place of a number +in any calculation that expects a number:

+
+

R +

+
+log(x)
+
+
+

OUTPUT +

+
[1] -3.688879
+
+

Notice also that variables can be reassigned:

+
+

R +

+
+x <- 100
+
+

x used to contain the value 0.025 and now it has the +value 100.

+

Assignment values can contain the variable being assigned to:

+
+

R +

+
+x <- x + 1 #notice how RStudio updates its description of x on the top right tab
+y <- x * 2
+
+

The right hand side of the assignment can be any valid R expression. +The right hand side is fully evaluated before the assignment +occurs.

+

Variable names can contain letters, numbers, underscores and periods +but no spaces. They must start with a letter or a period followed by a +letter (they cannot start with a number nor an underscore). Variables +beginning with a period are hidden variables. Different people use +different conventions for long variable names, these include

+
  • periods.between.words
  • +
  • underscores_between_words
  • +
  • camelCaseToSeparateWords
  • +

What you use is up to you, but be consistent.

+

It is also possible to use the = operator for +assignment:

+
+

R +

+
+x = 1/40
+
+

But this is much less common among R users. The most important thing +is to be consistent with the operator you use. There +are occasionally places where it is less confusing to use +<- than =, and it is the most common symbol +used in the community. So the recommendation is to use +<-.

+
+
+ +
+
+

Challenge 1 +

+
+

Which of the following are valid R variable names?

+
+

R +

+
min_height
+max.height
+_age
+.mass
+MaxLength
+min-length
+2widths
+celsius2kelvin
+
+
+
+
+
+
+ +
+
+

The following can be used as R variables:

+
+

R +

+
+min_height
+max.height
+MaxLength
+celsius2kelvin
+
+

The following creates a hidden variable:

+
+

R +

+
+.mass
+
+

The following will not be able to be used to create a variable

+
+

R +

+
_age
+min-length
+2widths
+
+
+
+
+
+

Vectorization +

+

One final thing to be aware of is that R is vectorized, +meaning that variables and functions can have vectors as values. In +contrast to physics and mathematics, a vector in R describes a set of +values in a certain order of the same data type. For example

+
+

R +

+
+1:5
+
+
+

OUTPUT +

+
[1] 1 2 3 4 5
+
+
+

R +

+
+2^(1:5)
+
+
+

OUTPUT +

+
[1]  2  4  8 16 32
+
+
+

R +

+
+x <- 1:5
+2^x
+
+
+

OUTPUT +

+
[1]  2  4  8 16 32
+
+

This is incredibly powerful; we will discuss this further in an +upcoming lesson.

+

Managing your environment +

+

There are a few useful commands you can use to interact with the R +session.

+

ls will list all of the variables and functions stored +in the global environment (your working R session):

+
+

R +

+
+ls()
+
+
+

OUTPUT +

+
[1] "x" "y"
+
+
+
+ +
+
+

Tip: hidden objects +

+
+

Like in the shell, ls will hide any variables or +functions starting with a “.” by default. To list all objects, type +ls(all.names=TRUE) instead

+
+
+
+

Note here that we didn’t give any arguments to ls, but +we still needed to give the parentheses to tell R to call the +function.

+

If we type ls by itself, R prints a bunch of code +instead of a listing of objects.

+
+

R +

+
+ls
+
+
+

OUTPUT +

+
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
+    pattern, sorted = TRUE) 
+{
+    if (!missing(name)) {
+        pos <- tryCatch(name, error = function(e) e)
+        if (inherits(pos, "error")) {
+            name <- substitute(name)
+            if (!is.character(name)) 
+                name <- deparse(name)
+            warning(gettextf("%s converted to character string", 
+                sQuote(name)), domain = NA)
+            pos <- name
+        }
+    }
+    all.names <- .Internal(ls(envir, all.names, sorted))
+    if (!missing(pattern)) {
+        if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
+            ll != length(grep("]", pattern, fixed = TRUE))) {
+            if (pattern == "[") {
+                pattern <- "\\["
+                warning("replaced regular expression pattern '[' by  '\\\\['")
+            }
+            else if (length(grep("[^\\\\]\\[<-", pattern))) {
+                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+            }
+        }
+        grep(pattern, all.names, value = TRUE)
+    }
+    else all.names
+}
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>
+
+

What’s going on here?

+

Like everything in R, ls is the name of an object, and +entering the name of an object by itself prints the contents of the +object. The object x that we created earlier contains 1, 2, +3, 4, 5:

+
+

R +

+
+x
+
+
+

OUTPUT +

+
[1] 1 2 3 4 5
+
+

The object ls contains the R code that makes the +ls function work! We’ll talk more about how functions work +and start writing our own later.

+

You can use rm to delete objects you no longer need:

+
+

R +

+
+rm(x)
+
+

If you have lots of things in your environment and want to delete all +of them, you can pass the results of ls to the +rm function:

+
+

R +

+
+rm(list = ls())
+
+

In this case we’ve combined the two. Like the order of operations, +anything inside the innermost parentheses is evaluated first, and so +on.

+

In this case we’ve specified that the results of ls +should be used for the list argument in rm. +When assigning values to arguments by name, you must use the += operator!!

+

If instead we use <-, there will be unintended side +effects, or you may get an error message:

+
+

R +

+
+rm(list <- ls())
+
+
+

ERROR +

+
Error in rm(list <- ls()): ... must contain names or character strings
+
+
+
+ +
+
+

Tip: Warnings vs. Errors +

+
+

Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn’t worked as expected.

+

In both cases, the message that R prints out usually give you clues +how to fix a problem.

+
+
+
+

R Packages +

+

It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages:

+
  • You can see what packages are installed by typing +installed.packages() +
  • +
  • You can install packages by typing +install.packages("packagename"), where +packagename is the package name, in quotes.
  • +
  • You can update installed packages by typing +update.packages() +
  • +
  • You can remove a package with +remove.packages("packagename") +
  • +
  • You can make a package available for use with +library(packagename) +
  • +

Packages can also be viewed, loaded, and detached in the Packages tab +of the lower right panel in RStudio. Clicking on this tab will display +all of the installed packages with a checkbox next to them. If the box +next to a package name is checked, the package is loaded and if it is +empty, the package is not loaded. Click an empty box to load that +package and click a checked box to detach that package.

+

Packages can be installed and updated from the Package tab with the +Install and Update buttons at the top of the tab.

+
+
+ +
+
+

Challenge 2 +

+
+

What will be the value of each variable after each statement in the +following program?

+
+

R +

+
+mass <- 47.5
+age <- 122
+mass <- mass * 2.3
+age <- age - 20
+
+
+
+
+
+
+ +
+
+
+

R +

+
+mass <- 47.5
+
+

This will give a value of 47.5 for the variable mass

+
+

R +

+
+age <- 122
+
+

This will give a value of 122 for the variable age

+
+

R +

+
+mass <- mass * 2.3
+
+

This will multiply the existing value of 47.5 by 2.3 to give a new +value of 109.25 to the variable mass.

+
+

R +

+
+age <- age - 20
+
+

This will subtract 20 from the existing value of 122 to give a new +value of 102 to the variable age.

+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age?

+
+
+
+
+
+ +
+
+

One way of answering this question in R is to use the +> to set up the following:

+
+

R +

+
+mass > age
+
+
+

OUTPUT +

+
[1] TRUE
+
+

This should yield a boolean value of TRUE since 109.25 is greater +than 102.

+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Clean up your working environment by deleting the mass and age +variables.

+
+
+
+
+
+ +
+
+

We can use the rm command to accomplish this task

+
+

R +

+
+rm(age, mass)
+
+
+
+
+
+
+
+ +
+
+

Challenge 5 +

+
+

Install the following packages: ggplot2, +plyr, gapminder

+
+
+
+
+
+ +
+
+

We can use the install.packages() command to install the +required packages.

+
+

R +

+
+install.packages("ggplot2")
+install.packages("plyr")
+install.packages("gapminder")
+
+

An alternate solution, to install multiple packages with a single +install.packages() command is:

+
+

R +

+
+install.packages(c("ggplot2", "plyr", "gapminder"))
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/02-project-intro.html b/02-project-intro.html new file mode 100644 index 000000000..3878b4fdb --- /dev/null +++ b/02-project-intro.html @@ -0,0 +1,821 @@ + +R for Reproducible Scientific Analysis: Project Management With RStudio +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Project Management With RStudio

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I manage my projects in R?
  • +
+
+
+
+
+
+

Objectives

+
  • Create self-contained projects in RStudio
  • +
+
+
+
+
+

Introduction +

+

The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and eventually +everything is a bit mixed together.

+ +

Most people tend to organize their projects like this:

+
Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

+
  1. It is really hard to tell which version of your data is the original +and which is the modified;
  2. +
  3. It gets really messy because it mixes files with various extensions +together;
  4. +
  5. It probably takes you a lot of time to actually find things, and +relate the correct figures to the exact code that has been used to +generate it;
  6. +

A good project layout will ultimately make your life easier:

+
  • It will help ensure the integrity of your data;
  • +
  • It makes it simpler to share your code with someone else (a +lab-mate, collaborator, or supervisor);
  • +
  • It allows you to easily upload your code with your manuscript +submission;
  • +
  • It makes it easier to pick the project back up after a break.
  • +

A possible solution +

+

Fortunately, there are tools and packages which can help you manage +your work effectively.

+

One of the most powerful and useful aspects of RStudio is its project +management functionality. We’ll be using this today to create a +self-contained, reproducible project.

+
+
+ +
+
+

Challenge 1: Creating a self-contained +project +

+
+

We’re going to create a new project in RStudio:

+
  1. Click the “File” menu button, then “New Project”.
  2. +
  3. Click “New Directory”.
  4. +
  5. Click “New Project”.
  6. +
  7. Type in the name of the directory to store your project, +e.g. “my_project”.
  8. +
  9. If available, select the checkbox for “Create a git +repository.”
  10. +
  11. Click the “Create Project” button.
  12. +
+
+
+

The simplest way to open an RStudio project once it has been created +is to click through your file system to get to the directory where it +was saved and double click on the .Rproj file. This will +open RStudio and start your R session in the same directory as the +.Rproj file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added +benefit of allowing you to open multiple projects at the same time each +open to its own project directory. This allows you to keep multiple +projects open without them interfering with each other.

+
+
+ +
+
+

Challenge 2: Opening an RStudio project +through the file system +

+
+
  1. Exit RStudio.
  2. +
  3. Navigate to the directory where you created a project in Challenge +1.
  4. +
  5. Double click on the .Rproj file in that directory.
  6. +
+
+
+

Best practices for project organization +

+

Although there is no “best” way to lay out a project, there are some +general principles to adhere to that will make project management +easier:

+
+

Treat data as read only

+

This is probably the most important goal of setting up a project. +Data is typically time consuming and/or expensive to collect. Working +with them interactively (e.g., in Excel) where they can be modified +means you are never sure of where the data came from, or how it has been +modified since collection. It is therefore a good idea to treat your +data as “read-only”.

+
+
+

Data Cleaning

+

In many cases your data will be “dirty”: it will need significant +preprocessing to get into a format R (or any other programming language) +will find useful. This task is sometimes called “data munging”. Storing +these scripts in a separate folder, and creating a second “read-only” +data folder to hold the “cleaned” data sets can prevent confusion +between the two sets.

+
+
+

Treat generated output as disposable

+

Anything generated by your scripts should be treated as disposable: +it should all be able to be regenerated from your scripts.

+

There are lots of different ways to manage this output. Having an +output folder with different sub-directories for each separate analysis +makes it easier later. Since many analyses are exploratory and don’t end +up being used in the final project, and some of the analyses get shared +between projects.

+
+
+ +
+
+

Tip: Good Enough Practices for Scientific +Computing +

+
+

Good +Enough Practices for Scientific Computing gives the following +recommendations for project organization:

+
  1. Put each project in its own directory, which is named after the +project.
  2. +
  3. Put text documents associated with the project in the +doc directory.
  4. +
  5. Put raw data and metadata in the data directory, and +files generated during cleanup and analysis in a results +directory.
  6. +
  7. Put source for the project’s scripts and programs in the +src directory, and programs brought in from elsewhere or +compiled locally in the bin directory.
  8. +
  9. Name all files to reflect their content or function.
  10. +
+
+
+
+
+

Separate function definition and application

+

One of the more effective ways to work with R is to start by writing +the code you want to run directly in a .R script, and then running the +selected lines (either using the keyboard shortcuts in RStudio or +clicking the “Run” button) in the interactive R console.

+

When your project is in its early stages, the initial .R script file +usually contains many lines of directly executed code. As it matures, +reusable chunks get pulled into their own functions. It’s a good idea to +separate these functions into two separate folders; one to store useful +functions that you’ll reuse across analyses and projects, and one to +store the analysis scripts.

+
+
+

Save the data in the data directory

+

Now we have a good directory structure we will now place/save the +data file in the data/ directory.

+
+
+ +
+
+

Challenge 3 +

+
+

Download the gapminder data from here.

+
  1. Download the file (right mouse click on the link above -> “Save +link as” / “Save file as”, or click on the link and after the page +loads, press Ctrl+S or choose File -> “Save +page as”)
  2. +
  3. Make sure it’s saved under the name +gapminder_data.csv +
  4. +
  5. Save the file in the data/ folder within your +project.
  6. +

We will load and inspect these data later.

+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

It is useful to get some general idea about the dataset, directly +from the command line, before loading it into R. Understanding the +dataset better will come in handy when making decisions on how to load +it in R. Use the command-line shell to answer the following +questions:

+
  1. What is the size of the file?
  2. +
  3. How many rows of data does it contain?
  4. +
  5. What kinds of values are stored in this file?
  6. +
+
+
+
+
+ +
+
+

By running these commands in the shell:

+
+

SH +

+
ls -lh data/gapminder_data.csv
+
+
+

OUTPUT +

+
-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv
+
+

The file size is 80K.

+
+

SH +

+
wc -l data/gapminder_data.csv
+
+
+

OUTPUT +

+
1705 data/gapminder_data.csv
+
+

There are 1705 lines. The data looks like:

+
+

SH +

+
head data/gapminder_data.csv
+
+
+

OUTPUT +

+
country,year,pop,continent,lifeExp,gdpPercap
+Afghanistan,1952,8425333,Asia,28.801,779.4453145
+Afghanistan,1957,9240934,Asia,30.332,820.8530296
+Afghanistan,1962,10267083,Asia,31.997,853.10071
+Afghanistan,1967,11537966,Asia,34.02,836.1971382
+Afghanistan,1972,13079460,Asia,36.088,739.9811058
+Afghanistan,1977,14880372,Asia,38.438,786.11336
+Afghanistan,1982,12881816,Asia,39.854,978.0114388
+Afghanistan,1987,13867957,Asia,40.822,852.3959448
+Afghanistan,1992,16317921,Asia,41.674,649.3413952
+
+
+
+
+
+
+
+ +
+
+

Tip: command line in RStudio +

+
+

The Terminal tab in the console pane provides a convenient place +directly within RStudio to interact directly with the command line.

+
+
+
+
+
+

Working directory

+

Knowing R’s current working directory is important because when you +need to access other files (for example, to import a data file), R will +look for them relative to the current working directory.

+

Each time you create a new RStudio Project, it will create a new +directory for that project. When you open an existing +.Rproj file, it will open that project and set R’s working +directory to the folder that file is in.

+
+
+ +
+
+

Challenge 5 +

+
+

You can check the current working directory with the +getwd() command, or by using the menus in RStudio.

+
  1. In the console, type getwd() (“wd” is short for +“working directory”) and hit Enter.
  2. +
  3. In the Files pane, double click on the data folder to +open it (or navigate to any other folder you wish). To get the Files +pane back to the current working directory, click “More” and then select +“Go To Working Directory”.
  4. +

You can change the working directory with setwd(), or by +using RStudio menus.

+
  1. In the console, type setwd("data") and hit Enter. Type +getwd() and hit Enter to see the new working +directory.
  2. +
  3. In the menus at the top of the RStudio window, click the “Session” +menu button, and then select “Set Working Directory” and then “Choose +Directory”. Next, in the windows navigator that opens, navigate back to +the project directory, and click “Open”. Note that a setwd +command will automatically appear in the console.
  4. +
+
+
+
+
+ +
+
+

Tip: File does not exist errors +

+
+

When you’re attempting to reference a file in your R code and you’re +getting errors saying the file doesn’t exist, it’s a good idea to check +your working directory. You need to either provide an absolute path to +the file, or you need to make sure the file is saved in the working +directory (or a subfolder of the working directory) and provide a +relative path.

+
+
+
+
+
+

Version Control

+

It is important to use version control with projects. Go here +for a good lesson which describes using Git with RStudio.

+
+
+ +
+
+

Keypoints +

+
+
  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +
+
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/03-seeking-help.html b/03-seeking-help.html new file mode 100644 index 000000000..3e5fb236e --- /dev/null +++ b/03-seeking-help.html @@ -0,0 +1,860 @@ + +R for Reproducible Scientific Analysis: Seeking Help +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Seeking Help

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I get help in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to read R help files for functions and special +operators.
  • +
  • To be able to use CRAN task views to identify packages to solve a +problem.
  • +
  • To be able to seek help from your peers.
  • +
+
+
+
+
+

Reading Help Files +

+

R, and every package, provide help files for functions. The general +syntax to search for help on any function, “function_name”, from a +specific function that is in a package loaded into your namespace (your +interactive R session) is:

+
+

R +

+
+?function_name
+help(function_name)
+
+

For example take a look at the help file for +write.table(), we will be using a similar function in an +upcoming episode.

+
+

R +

+
+?write.table()
+
+

This will load up a help page in RStudio (or as plain text in R +itself).

+

Each help page is broken down into sections:

+
  • Description: An extended description of what the function does.
  • +
  • Usage: The arguments of the function and their default values (which +can be changed).
  • +
  • Arguments: An explanation of the data each argument is +expecting.
  • +
  • Details: Any important details to be aware of.
  • +
  • Value: The data the function returns.
  • +
  • See Also: Any related functions you might find useful.
  • +
  • Examples: Some examples for how to use the function.
  • +

Different functions might have different sections, but these are the +main ones you should be aware of.

+

Notice how related functions might call for the same help file:

+
+

R +

+
+?write.table()
+?write.csv()
+
+

This is because these functions have very similar applicability and +often share the same arguments as inputs to the function, so package +authors often choose to document them together in a single help +file.

+
+
+ +
+
+

Tip: Running Examples +

+
+

From within the function help page, you can highlight code in the +Examples and hit Ctrl+Return to run it in RStudio +console. This gives you a quick way to get a feel for how a function +works.

+
+
+
+
+
+ +
+
+

Tip: Reading Help Files +

+
+

One of the most daunting aspects of R is the large number of +functions available. It would be prohibitive, if not impossible to +remember the correct usage for every function you use. Luckily, using +the help files means you don’t have to remember that!

+
+
+
+

Special Operators +

+

To seek help on special operators, use quotes or backticks:

+
+

R +

+
+?"<-"
+?`<-`
+
+

Getting Help with Packages +

+

Many packages come with “vignettes”: tutorials and extended example +documentation. Without any arguments, vignette() will list +all vignettes for all installed packages; +vignette(package="package-name") will list all available +vignettes for package-name, and +vignette("vignette-name") will open the specified +vignette.

+

If a package doesn’t have any vignettes, you can usually find help by +typing help("package-name").

+

RStudio also has a set of excellent cheatsheets for +many packages.

+

When You Remember Part of the Function Name +

+

If you’re not sure what package a function is in or how it’s +specifically spelled, you can do a fuzzy search:

+
+

R +

+
+??function_name
+
+

A fuzzy search is when you search for an approximate string match. +For example, you may remember that the function to set your working +directory includes “set” in its name. You can do a fuzzy search to help +you identify the function:

+
+

R +

+
+??set
+
+

When You Have No Idea Where to Begin +

+

If you don’t know what function or package you need to use CRAN Task Views is a +specially maintained list of packages grouped into fields. This can be a +good starting point.

+

When Your Code Doesn’t Work: Seeking Help from Your Peers +

+

If you’re having trouble using a function, 9 times out of 10, the +answers you seek have already been answered on Stack Overflow. You can search +using the [r] tag. Please make sure to see their page on how to ask a good +question.

+

If you can’t find the answer, there are a few useful functions to +help you ask your peers:

+
+

R +

+
+?dput
+
+

Will dump the data you’re working with into a format that can be +copied and pasted by others into their own R session.

+
+

R +

+
+sessionInfo()
+
+
+

OUTPUT +

+
R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
+ [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
+ [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
+[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
+
+time zone: UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats     graphics  grDevices utils     datasets  methods   base     
+
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0 yaml_2.3.7       
+[5] knitr_1.43        xfun_0.40         renv_1.0.3        evaluate_0.21    
+
+

Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue.

+
+
+ +
+
+

Challenge 1 +

+
+

Look at the help page for the c function. What kind of +vector do you expect will be created if you evaluate the following:

+
+

R +

+
+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+
+
+
+
+
+
+ +
+
+

The c() function creates a vector, in which all elements +are of the same type. In the first case, the elements are numeric, in +the second, they are characters, and in the third they are also +characters: the numeric values are “coerced” to be characters.

+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Look at the help for the paste function. You will need +to use it later. What’s the difference between the sep and +collapse arguments?

+
+
+
+
+
+ +
+
+

To look at the help for the paste() function, use:

+
+

R +

+
+help("paste")
+?paste
+
+

The difference between sep and collapse is +a little tricky. The paste function accepts any number of +arguments, each of which can be a vector of any length. The +sep argument specifies the string used between concatenated +terms — by default, a space. The result is a vector as long as the +longest argument supplied to paste. In contrast, +collapse specifies that after concatenation the elements +are collapsed together using the given separator, the result +being a single string.

+

It is important to call the arguments explicitly by typing out the +argument name e.g sep = "," so the function understands to +use the “,” as a separator and not a term to concatenate. e.g.

+
+

R +

+
+paste(c("a","b"), "c")
+
+
+

OUTPUT +

+
[1] "a c" "b c"
+
+
+

R +

+
+paste(c("a","b"), "c", ",")
+
+
+

OUTPUT +

+
[1] "a c ," "b c ,"
+
+
+

R +

+
+paste(c("a","b"), "c", sep = ",")
+
+
+

OUTPUT +

+
[1] "a,c" "b,c"
+
+
+

R +

+
+paste(c("a","b"), "c", collapse = "|")
+
+
+

OUTPUT +

+
[1] "a c|b c"
+
+
+

R +

+
+paste(c("a","b"), "c", sep = ",", collapse = "|")
+
+
+

OUTPUT +

+
[1] "a,c|b,c"
+
+

(For more information, scroll to the bottom of the +?paste help page and look at the examples, or try +example('paste').)

+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Use help to find a function (and its associated parameters) that you +could use to load data from a tabular file in which columns are +delimited with “\t” (tab) and the decimal point is a “.” (period). This +check for decimal separator is important, especially if you are working +with international colleagues, because different countries have +different conventions for the decimal point (i.e. comma vs period). +Hint: use ??"read table" to look up functions related to +reading in tabular data.

+
+
+
+
+
+ +
+
+

The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +read.table(file, sep="\t") (the period is the +default decimal separator for read.table()), +although you may have to change the comment.char argument +as well if your data file contains hash (#) characters.

+
+
+
+
+

Other Resources +

+
+
+ +
+
+

Keypoints +

+
+
  • Use help() to get online help in R.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/04-data-structures-part1.html b/04-data-structures-part1.html new file mode 100644 index 000000000..680e3e144 --- /dev/null +++ b/04-data-structures-part1.html @@ -0,0 +1,2396 @@ + +R for Reproducible Scientific Analysis: Data Structures +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Data Structures

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I read data in R?
  • +
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to identify the 5 main data types.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
  • To understand the information of the attributes “names”, “class”, +and “dim”.
  • +
+
+
+
+
+

One of R’s most powerful features is its ability to deal with tabular +data - such as you may already have in a spreadsheet or a CSV file. +Let’s start by making a toy dataset in your data/ +directory, called feline-data.csv:

+
+

R +

+
+cats <- data.frame(coat = c("calico", "black", "tabby"),
+                    weight = c(2.1, 5.0, 3.2),
+                    likes_string = c(1, 0, 1))
+
+

We can now save cats as a CSV file. It is good practice +to call the argument names explicitly so the function knows what default +values you are changing. Here we are setting +row.names = FALSE. Recall you can use +?write.csv to pull up the help file to check out the +argument names and their default values.

+
+

R +

+
+write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)
+
+

The contents of the new file, feline-data.csv:

+
+

R +

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+
+
+
+ +
+
+

Tip: Editing Text files in R +

+
+

Alternatively, you can create data/feline-data.csv using +a text editor (Nano), or within RStudio with the File -> New +File -> Text File menu item.

+
+
+
+

We can load this into R via the following:

+
+

R +

+
+cats <- read.csv(file = "data/feline-data.csv")
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1
+
+

The read.table function is used for reading in tabular +data stored in a text file where the columns of data are separated by +punctuation characters such as CSV files (csv = comma-separated values). +Tabs and commas are the most common punctuation characters used to +separate or delimit data points in csv files. For convenience R provides +2 other versions of read.table. These are: +read.csv for files where the data are separated with commas +and read.delim for files where the data are separated with +tabs. Of these three functions read.csv is the most +commonly used. If needed it is possible to override the default +delimiting punctuation marks for both read.csv and +read.delim.

+
+
+ +
+
+

Check your data for factors +

+
+

In recent times, the default way how R handles textual data has +changed. Text data was interpreted by R automatically into a format +called “factors”. But there is an easier format that is called +“character”. We will hear about factors later, and what to use them for. +For now, remember that in most cases, they are not needed and only +complicate your life, which is why newer R versions read in text as +“character”. Check now if your version of R has automatically created +factors and convert them to “character” format:

+
  1. Check the data types of your input by typing +str(cats) +
  2. +
  3. In the output, look at the three-letter codes after the colons: If +you see only “num” and “chr”, you can continue with the lesson and skip +this box. If you find “fct”, continue to step 3.
  4. +
  5. Prevent R from automatically creating “factor” data. That can be +done by the following code: +options(stringsAsFactors = FALSE). Then, re-read the cats +table for the change to take effect.
  6. +
  7. You must set this option every time you restart R. To not forget +this, include it in your analysis script before you read in any data, +for example in one of the first lines.
  8. +
  9. For R versions greater than 4.0.0, text data is no longer converted +to factors anymore. So you can install this or a newer version to avoid +this problem. If you are working on an institute or company computer, +ask your administrator to do it.
  10. +
+
+
+

We can begin exploring our dataset right away, pulling out columns by +specifying them using the $ operator:

+
+

R +

+
+cats$weight
+
+
+

OUTPUT +

+
[1] 2.1 5.0 3.2
+
+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

We can do other operations on the columns:

+
+

R +

+
+## Say we discovered that the scale weighs two Kg light:
+cats$weight + 2
+
+
+

OUTPUT +

+
[1] 4.1 7.0 5.2
+
+
+

R +

+
+paste("My cat is", cats$coat)
+
+
+

OUTPUT +

+
[1] "My cat is calico" "My cat is black"  "My cat is tabby" 
+
+

But what about

+
+

R +

+
+cats$weight + cats$coat
+
+
+

ERROR +

+
Error in cats$weight + cats$coat: non-numeric argument to binary operator
+
+

Understanding what happened here is key to successfully analyzing +data in R.

+
+

Data Types

+

If you guessed that the last command will return an error because +2.1 plus "black" is nonsense, you’re right - +and you already have some intuition for an important concept in +programming called data types. We can ask what type of data +something is:

+
+

R +

+
+typeof(cats$weight)
+
+
+

OUTPUT +

+
[1] "double"
+
+

There are 5 main types: double, integer, +complex, logical and character. +For historic reasons, double is also called +numeric.

+
+

R +

+
+typeof(3.14)
+
+
+

OUTPUT +

+
[1] "double"
+
+
+

R +

+
+typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers
+
+
+

OUTPUT +

+
[1] "integer"
+
+
+

R +

+
+typeof(1+1i)
+
+
+

OUTPUT +

+
[1] "complex"
+
+
+

R +

+
+typeof(TRUE)
+
+
+

OUTPUT +

+
[1] "logical"
+
+
+

R +

+
+typeof('banana')
+
+
+

OUTPUT +

+
[1] "character"
+
+

No matter how complicated our analyses become, all data in R is +interpreted as one of these basic data types. This strictness has some +really important consequences.

+

A user has added details of another cat. This information is in the +file data/feline-data_v2.csv.

+
+

R +

+
+file.show("data/feline-data_v2.csv")
+
+
+

R +

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+tabby,2.3 or 2.4,1
+
+

Load the new cats data like before, and check what type of data we +find in the weight column:

+
+

R +

+
+cats <- read.csv(file="data/feline-data_v2.csv")
+typeof(cats$weight)
+
+
+

OUTPUT +

+
[1] "character"
+
+

Oh no, our weights aren’t the double type anymore! If we try to do +the same math we did on them before, we run into trouble:

+
+

R +

+
+cats$weight + 2
+
+
+

ERROR +

+
Error in cats$weight + 2: non-numeric argument to binary operator
+
+

What happened? The cats data we are working with is +something called a data frame. Data frames are one of the most +common and versatile types of data structures we will work with +in R. A given column in a data frame cannot be composed of different +data types. In this case, R does not read everything in the data frame +column weight as a double, therefore the entire +column data type changes to something that is suitable for everything in +the column.

+

When R reads a csv file, it reads it in as a data frame. +Thus, when we loaded the cats csv file, it is stored as a +data frame. We can recognize data frames by the first row that is +written by the str() function:

+
+

R +

+
+str(cats)
+
+
+

OUTPUT +

+
'data.frame':	4 obs. of  3 variables:
+ $ coat        : chr  "calico" "black" "tabby" "tabby"
+ $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
+ $ likes_string: int  1 0 1 1
+
+

Data frames are composed of rows and columns, where each +column has the same number of rows. Different columns in a data frame +can be made up of different data types (this is what makes them so +versatile), but everything in a given column needs to be the same type +(e.g., vector, factor, or list).

+

Let’s explore more about different data structures and how they +behave. For now, let’s remove that extra line from our cats data and +reload it, while we investigate this behavior further:

+

feline-data.csv:

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+

And back in RStudio:

+
+

R +

+
+cats <- read.csv(file="data/feline-data.csv")
+
+
+
+

Vectors and Type Coercion

+

To better understand this behavior, let’s meet another of the data +structures: the vector.

+
+

R +

+
+my_vector <- vector(length = 3)
+my_vector
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE
+
+

A vector in R is essentially an ordered list of things, with the +special condition that everything in the vector must be the same +basic data type. If you don’t choose the datatype, it’ll default to +logical; or, you can declare an empty vector of whatever +type you like.

+
+

R +

+
+another_vector <- vector(mode='character', length=3)
+another_vector
+
+
+

OUTPUT +

+
[1] "" "" ""
+
+

You can check if something is a vector:

+
+

R +

+
+str(another_vector)
+
+
+

OUTPUT +

+
 chr [1:3] "" "" ""
+
+

The somewhat cryptic output from this command indicates the basic +data type found in this vector - in this case chr, +character; an indication of the number of things in the vector - +actually, the indexes of the vector, in this case [1:3]; +and a few examples of what’s actually in the vector - in this case empty +character strings. If we similarly do

+
+

R +

+
+str(cats$weight)
+
+
+

OUTPUT +

+
 num [1:3] 2.1 5 3.2
+
+

we see that cats$weight is a vector, too - the +columns of data we load into R data.frames are all vectors, and +that’s the root of why R forces everything in a column to be the same +basic data type.

+
+
+ +
+
+

Discussion 1 +

+
+

Why is R so opinionated about what we put in our columns of data? How +does this help us?

+
+
+ +
+
+

By keeping everything in a column the same, we allow ourselves to +make simple assumptions about our data; if you can interpret one entry +in the column as a number, then you can interpret all of them +as numbers, so we don’t have to check every time. This consistency is +what people mean when they talk about clean data; in the long +run, strict consistency goes a long way to making our lives easier in +R.

+
+
+
+
+
+
+
+
+

Coercion by combining vectors

+

You can also make vectors with explicit contents with the combine +function:

+
+

R +

+
+combine_vector <- c(2,6,3)
+combine_vector
+
+
+

OUTPUT +

+
[1] 2 6 3
+
+

Given what we’ve learned so far, what do you think the following will +produce?

+
+

R +

+
+quiz_vector <- c(2,6,'3')
+
+

This is something called type coercion, and it is the source +of many surprises and the reason why we need to be aware of the basic +data types and how R will interpret them. When R encounters a mix of +types (here double and character) to be combined into a single vector, +it will force them all to be the same type. Consider:

+
+

R +

+
+coercion_vector <- c('a', TRUE)
+coercion_vector
+
+
+

OUTPUT +

+
[1] "a"    "TRUE"
+
+
+

R +

+
+another_coercion_vector <- c(0, TRUE)
+another_coercion_vector
+
+
+

OUTPUT +

+
[1] 0 1
+
+
+
+

The type hierarchy

+

The coercion rules go: logical -> +integer -> double (“numeric”) +-> complex -> character, where -> can +be read as are transformed into. For example, combining +logical and character transforms the result to +character:

+
+

R +

+
+c('a', TRUE)
+
+
+

OUTPUT +

+
[1] "a"    "TRUE"
+
+

A quick way to recognize character vectors is by the +quotes that enclose them when they are printed.

+

You can try to force coercion against this flow using the +as. functions:

+
+

R +

+
+character_vector_example <- c('0','2','4')
+character_vector_example
+
+
+

OUTPUT +

+
[1] "0" "2" "4"
+
+
+

R +

+
+character_coerced_to_double <- as.double(character_vector_example)
+character_coerced_to_double
+
+
+

OUTPUT +

+
[1] 0 2 4
+
+
+

R +

+
+double_coerced_to_logical <- as.logical(character_coerced_to_double)
+double_coerced_to_logical
+
+
+

OUTPUT +

+
[1] FALSE  TRUE  TRUE
+
+

As you can see, some surprising things can happen when R forces one +basic data type into another! Nitty-gritty of type coercion aside, the +point is: if your data doesn’t look like what you thought it was going +to look like, type coercion may well be to blame; make sure everything +is the same type in your vectors and your columns of data.frames, or you +will get nasty surprises!

+

But coercion can also be very useful! For example, in our +cats data likes_string is numeric, but we know +that the 1s and 0s actually represent TRUE and +FALSE (a common way of representing them). We should use +the logical datatype here, which has two states: +TRUE or FALSE, which is exactly what our data +represents. We can ‘coerce’ this column to be logical by +using the as.logical function:

+
+

R +

+
+cats$likes_string
+
+
+

OUTPUT +

+
[1] 1 0 1
+
+
+

R +

+
+cats$likes_string <- as.logical(cats$likes_string)
+cats$likes_string
+
+
+

OUTPUT +

+
[1]  TRUE FALSE  TRUE
+
+
+
+ +
+
+

Challenge 1 +

+
+

An important part of every data analysis is cleaning the input data. +If you know that the input data is all of the same format, +(e.g. numbers), your analysis is much easier! Clean the cat data set +from the chapter about type coercion.

+
+

Copy the code template

+

Create a new script in RStudio and copy and paste the following code. +Then move on to the tasks below, which help you to fill in the gaps +(______).

+
# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+
+# 1. Print the data
+_____
+
+# 2. Show an overview of the table with all data types
+_____(cats)
+
+# 3. The "weight" column has the incorrect data type __________.
+#    The correct data type is: ____________.
+
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+#    print the data again to see the effect
+cats
+
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+
+#    Calculate the mean to test yourself
+mean(cats$weight)
+
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!
+
+
+

Instructions for the tasks

+
+ +

Execute the first statement (read.csv(...)). Then print +the data to the console

+
+
+
+
+
+
+
+ +
+
+

Show the content of any variable by typing its name.

+
+

Solution to Challenge 1.1

+

Two correct solutions:

+
cats
+print(cats)
+
+
+
+
+
+
+
+ +
+
+

2. Overview of the data types +

+
+

The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of +the cats table.

+
+
+
+
+
+ +
+
+

In the chapter “Data types” we saw two functions that can show data +types. One printed just a single word, the data type name. The other +printed a short form of the data type, and the first few values. We need +the second here.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to Challenge 1.2

+
str(cats)
+
+
+

3. Which data type do we need?

+

The shown data type is not the right one for this data (weight of a +cat). Which data type do we need?

+
  • Why did the read.csv() function not choose the correct +data type?
  • +
  • Fill in the gap in the comment with the correct data type for cat +weight!
  • +
+
+
+
+
+
+ +
+
+

Scroll up to the section about the type +hierarchy to review the available data types

+
+
+
+
+
+
+ +
+
+
  • Weight is expressed on a continuous scale (real numbers). The R data +type for this is “double” (also known as “numeric”).
  • +
  • The fourth row has the value “2.3 or 2.4”. That is not a number but +two, and an english word. Therefore, the “character” data type is +chosen. The whole column is now text, because all values in the same +columns have to be the same data type.
  • +
+
+
+
+
+
+ +
+
+

4. Correct the problematic value +

+
+

The code to assign a new weight value to the problematic fourth row +is given. Think first and then execute it: What will be the data type +after assigning a number like in this example? You can check the data +type after executing to see if you were right.

+
+
+
+
+
+ +
+
+

Revisit the hierarchy of data types when two different data types are +combined.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to challenge 1.4

+

The data type of the column “weight” is “character”. The assigned +data type is “double”. Combining two data types yields the data type +that is higher in the following hierarchy:

+
logical < integer < double < complex < character
+

Therefore, the column is still of type character! We need to manually +convert it to “double”. {: .solution}

+
+
+

5. Convert the column “weight” to the correct data type

+

Cat weight are numbers. But the column does not have this data type +yet. Coerce the column to floating point numbers.

+
+
+
+
+
+
+ +
+
+

The functions to convert data types start with as.. You +can look for the function further up in the manuscript or use the +RStudio auto-complete function: Type “as.” and then press +the TAB key.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to Challenge 1.5

+

There are two functions that are synonymous for historic reasons:

+
cats$weight <- as.double(cats$weight)
+cats$weight <- as.numeric(cats$weight)
+
+
+
+
+
+
+
+

Some basic vector functions

+

The combine function, c(), will also append things to an +existing vector:

+
+

R +

+
+ab_vector <- c('a', 'b')
+ab_vector
+
+
+

OUTPUT +

+
[1] "a" "b"
+
+
+

R +

+
+combine_example <- c(ab_vector, 'SWC')
+combine_example
+
+
+

OUTPUT +

+
[1] "a"   "b"   "SWC"
+
+

You can also make series of numbers:

+
+

R +

+
+mySeries <- 1:10
+mySeries
+
+
+

OUTPUT +

+
 [1]  1  2  3  4  5  6  7  8  9 10
+
+
+

R +

+
+seq(10)
+
+
+

OUTPUT +

+
 [1]  1  2  3  4  5  6  7  8  9 10
+
+
+

R +

+
+seq(1,10, by=0.1)
+
+
+

OUTPUT +

+
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
+[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
+[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
+[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
+[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
+[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
+[91] 10.0
+
+

We can ask a few questions about vectors:

+
+

R +

+
+sequence_example <- 20:25
+head(sequence_example, n=2)
+
+
+

OUTPUT +

+
[1] 20 21
+
+
+

R +

+
+tail(sequence_example, n=4)
+
+
+

OUTPUT +

+
[1] 22 23 24 25
+
+
+

R +

+
+length(sequence_example)
+
+
+

OUTPUT +

+
[1] 6
+
+
+

R +

+
+typeof(sequence_example)
+
+
+

OUTPUT +

+
[1] "integer"
+
+

We can get individual elements of a vector by using the bracket +notation:

+
+

R +

+
+first_element <- sequence_example[1]
+first_element
+
+
+

OUTPUT +

+
[1] 20
+
+

To change a single element, use the bracket on the other side of the +arrow:

+
+

R +

+
+sequence_example[1] <- 30
+sequence_example
+
+
+

OUTPUT +

+
[1] 30 21 22 23 24 25
+
+
+
+ +
+
+

Challenge 2 +

+
+

Start by making a vector with the numbers 1 through 26. Then, +multiply the vector by 2.

+
+
+
+
+
+ +
+
+
+

R +

+
+x <- 1:26
+x <- x * 2
+
+
+
+
+
+
+
+

Lists

+

Another data structure you’ll want in your bag of tricks is the +list. A list is simpler in some ways than the other types, +because you can put anything you want in it. Remember everything in +the vector must be of the same basic data type, but a list can have +different data types:

+
+

R +

+
+list_example <- list(1, "a", TRUE, 1+4i)
+list_example
+
+
+

OUTPUT +

+
[[1]]
+[1] 1
+
+[[2]]
+[1] "a"
+
+[[3]]
+[1] TRUE
+
+[[4]]
+[1] 1+4i
+
+

When printing the object structure with str(), we see +the data types of all elements:

+
+

R +

+
+str(list_example)
+
+
+

OUTPUT +

+
List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i
+
+

What is the use of lists? They can organize data of different +types. For example, you can organize different tables that +belong together, similar to spreadsheets in Excel. But there are many +other uses, too.

+

We will see another example that will maybe surprise you in the next +chapter.

+

To retrieve one of the elements of a list, use the double +bracket:

+
+

R +

+
+list_example[[2]]
+
+
+

OUTPUT +

+
[1] "a"
+
+

The elements of lists also can have names, they can +be given by prepending them to the values, separated by an equals +sign:

+
+

R +

+
+another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
+another_list
+
+
+

OUTPUT +

+
$title
+[1] "Numbers"
+
+$numbers
+ [1]  1  2  3  4  5  6  7  8  9 10
+
+$data
+[1] TRUE
+
+

This results in a named list. Now we have a new +function of our object! We can access single elements by an additional +way!

+
+

R +

+
+another_list$title
+
+
+

OUTPUT +

+
[1] "Numbers"
+
+
+

Names +

+

With names, we can give meaning to elements. It is the first time +that we do not only have the data, but also explaining +information. It is metadata that can be stuck to the object +like a label. In R, this is called an attribute. Some +attributes enable us to do more with our object, for example, like here, +accessing an element by a self-defined name.

+
+

Accessing vectors and lists by name

+

We have already seen how to generate a named list. The way to +generate a named vector is very similar. You have seen this function +before:

+
+

R +

+
+pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )
+
+

The way to retrieve elements is different, though:

+
+

R +

+
+pizza_price["pizzasubito"]
+
+
+

OUTPUT +

+
pizzasubito 
+       5.64 
+
+

The approach used for the list does not work:

+
+

R +

+
+pizza_price$pizzafresh
+
+
+

ERROR +

+
Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors
+
+

It will pay off if you remember this error message, you will meet it +in your own analyses. It means that you have just tried accessing an +element like it was in a list, but it is actually in a vector.

+
+
+

Accessing and changing names

+

If you are only interested in the names, use the names() +function:

+
+

R +

+
+names(pizza_price)
+
+
+

OUTPUT +

+
[1] "pizzasubito" "pizzafresh"  "callapizza" 
+
+

We have seen how to access and change single elements of a vector. +The same is possible for names:

+
+

R +

+
+names(pizza_price)[3]
+
+
+

OUTPUT +

+
[1] "callapizza"
+
+
+

R +

+
+names(pizza_price)[3] <- "call-a-pizza"
+pizza_price
+
+
+

OUTPUT +

+
 pizzasubito   pizzafresh call-a-pizza 
+        5.64         6.60         4.50 
+
+
+
+ +
+
+

Challenge 3 +

+
+
  • What is the data type of the names of pizza_price? You +can find out using the str() or typeof() +functions.
  • +
+
+
+
+
+ +
+
+

You get the names of an object by wrapping the object name inside +names(...). Similarly, you get the data type of the names +by again wrapping the whole code in typeof(...):

+
typeof(names(pizza))
+

alternatively, use a new variable if this is easier for you to +read:

+
n <- names(pizza)
+typeof(n)
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Instead of just changing some of the names a vector/list already has, +you can also set all names of an object by writing code like (replace +ALL CAPS text):

+
names( OBJECT ) <-  CHARACTER_VECTOR
+

Create a vector that gives the number for each letter in the +alphabet!

+
  1. Generate a vector called letter_no with the sequence of +numbers from 1 to 26!
  2. +
  3. R has a built-in object called LETTERS. It is a +26-character vector, from A to Z. Set the names of the number sequence +to this 26 letters
  4. +
  5. Test yourself by calling letter_no["B"], which should +give you the number 2!
  6. +
+
+
+
+
+ +
+
+
letter_no <- 1:26   # or seq(1,26)
+names(letter_no) <- LETTERS
+letter_no["B"]
+
+
+
+
+
+

Data frames +

+

We have data frames at the very beginning of this lesson, they +represent a table of data. We didn’t go much further into detail with +our example cat data frame:

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+2  black    5.0        FALSE
+3  tabby    3.2         TRUE
+
+

We can now understand something a bit surprising in our data.frame; +what happens if we run:

+
+

R +

+
+typeof(cats)
+
+
+

OUTPUT +

+
[1] "list"
+
+

We see that data.frames look like lists ‘under the hood’. Think again +what we heard about what lists can be used for:

+
+

Lists organize data of different types

+
+

Columns of a data frame are vectors of different types, that are +organized by belonging to the same table.

+

A data.frame is really a list of vectors. It is a special list in +which all the vectors must have the same length.

+

How is this “special”-ness written into the object, so that R does +not treat it like any other list, but as a table?

+
+

R +

+
+class(cats)
+
+
+

OUTPUT +

+
[1] "data.frame"
+
+

A class, just like names, is an attribute attached +to the object. It tells us what this object means for humans.

+

You might wonder: Why do we need another +what-type-of-object-is-this-function? We already have +typeof()? That function tells us how the object is +constructed in the computer. The class is +the meaning of the object for humans. Consequently, +what typeof() returns is fixed in R (mainly the +five data types), whereas the output of class() is +diverse and extendable by R packages.

+

In our cats example, we have an integer, a double and a +logical variable. As we have seen already, each column of data.frame is +a vector.

+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+
+

R +

+
+cats[,1]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+
+

R +

+
+typeof(cats[,1])
+
+
+

OUTPUT +

+
[1] "character"
+
+
+

R +

+
+str(cats[,1])
+
+
+

OUTPUT +

+
 chr [1:3] "calico" "black" "tabby"
+
+

Each row is an observation of different variables, itself a +data.frame, and thus can be composed of elements of different types.

+
+

R +

+
+cats[1,]
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+
+
+

R +

+
+typeof(cats[1,])
+
+
+

OUTPUT +

+
[1] "list"
+
+
+

R +

+
+str(cats[1,])
+
+
+

OUTPUT +

+
'data.frame':	1 obs. of  3 variables:
+ $ coat        : chr "calico"
+ $ weight      : num 2.1
+ $ likes_string: logi TRUE
+
+
+
+ +
+
+

Challenge 5 +

+
+

There are several subtly different ways to call variables, +observations and elements from data.frames:

+
  • cats[1]
  • +
  • cats[[1]]
  • +
  • cats$coat
  • +
  • cats["coat"]
  • +
  • cats[1, 1]
  • +
  • cats[, 1]
  • +
  • cats[1, ]
  • +

Try out these examples and explain what is returned by each one.

+

Hint: Use the function typeof() to examine what +is returned in each case.

+
+
+
+
+
+ +
+
+
+

R +

+
+cats[1]
+
+
+

OUTPUT +

+
    coat
+1 calico
+2  black
+3  tabby
+
+

We can think of a data frame as a list of vectors. The single brace +[1] returns the first slice of the list, as another list. +In this case it is the first column of the data frame.

+
+

R +

+
+cats[[1]]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

The double brace [[1]] returns the contents of the list +item. In this case it is the contents of the first column, a +vector of type character.

+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

This example uses the $ character to address items by +name. coat is the first column of the data frame, again a +vector of type character.

+
+

R +

+
+cats["coat"]
+
+
+

OUTPUT +

+
    coat
+1 calico
+2  black
+3  tabby
+
+

Here we are using a single brace ["coat"] replacing the +index number with the column name. Like example 1, the returned object +is a list.

+
+

R +

+
+cats[1, 1]
+
+
+

OUTPUT +

+
[1] "calico"
+
+

This example uses a single brace, but this time we provide row and +column coordinates. The returned object is the value in row 1, column 1. +The object is a vector of type character.

+
+

R +

+
+cats[, 1]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

Like the previous example we use single braces and provide row and +column coordinates. The row coordinate is not specified, R interprets +this missing value as all the elements in this column and +returns them as a vector.

+
+

R +

+
+cats[1, ]
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+
+

Again we use the single brace with row and column coordinates. The +column coordinate is not specified. The return value is a list +containing all the values in the first row.

+
+
+
+
+
+
+ +
+
+

Tip: Renaming data frame columns +

+
+

Data frames have column names, which can be accessed with the +names() function.

+
+

R +

+
+names(cats)
+
+
+

OUTPUT +

+
[1] "coat"         "weight"       "likes_string"
+
+

If you want to rename the second column of cats, you can +assign a new name to the second element of names(cats).

+
+

R +

+
+names(cats)[2] <- "weight_kg"
+cats
+
+
+

OUTPUT +

+
    coat weight_kg likes_string
+1 calico       2.1         TRUE
+2  black       5.0        FALSE
+3  tabby       3.2         TRUE
+
+
+
+
+
+

Matrices

+

Last but not least is the matrix. We can declare a matrix full of +zeros:

+
+

R +

+
+matrix_example <- matrix(0, ncol=6, nrow=3)
+matrix_example
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    0    0    0    0    0    0
+[2,]    0    0    0    0    0    0
+[3,]    0    0    0    0    0    0
+
+

What makes it special is the dim() attribute:

+
+

R +

+
+dim(matrix_example)
+
+
+

OUTPUT +

+
[1] 3 6
+
+

And similar to other data structures, we can ask things about our +matrix:

+
+

R +

+
+typeof(matrix_example)
+
+
+

OUTPUT +

+
[1] "double"
+
+
+

R +

+
+class(matrix_example)
+
+
+

OUTPUT +

+
[1] "matrix" "array" 
+
+
+

R +

+
+str(matrix_example)
+
+
+

OUTPUT +

+
 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
+
+
+

R +

+
+nrow(matrix_example)
+
+
+

OUTPUT +

+
[1] 3
+
+
+

R +

+
+ncol(matrix_example)
+
+
+

OUTPUT +

+
[1] 6
+
+
+
+ +
+
+

Challenge 6 +

+
+

What do you think will be the result of +length(matrix_example)? Try it. Were you right? Why / why +not?

+
+
+
+
+
+ +
+
+

What do you think will be the result of +length(matrix_example)?

+
+

R +

+
+matrix_example <- matrix(0, ncol=6, nrow=3)
+length(matrix_example)
+
+
+

OUTPUT +

+
[1] 18
+
+

Because a matrix is a vector with added dimension attributes, +length gives you the total number of elements in the +matrix.

+
+
+
+
+
+
+ +
+
+

Challenge 7 +

+
+

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+
+
+
+
+
+ +
+
+

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+
+

R +

+
+x <- matrix(1:50, ncol=5, nrow=10)
+x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
+
+
+
+
+
+
+
+ +
+
+

Challenge 8 +

+
+

Create a list of length two containing a character vector for each of +the sections in this part of the workshop:

+
  • Data types
  • +
  • Data structures
  • +

Populate each character vector with the names of the data types and +data structures we’ve seen so far.

+
+
+
+
+
+ +
+
+
+

R +

+
+dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
+dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
+answer <- list(dataTypes, dataStructures)
+
+

Note: it’s nice to make a list in big writing on the board or taped +to the wall listing all of these types and structures - leave it up for +the rest of the workshop to remind people of the importance of these +basics.

+
+
+
+
+
+
+ +
+
+

Challenge 9 +

+
+

Consider the R output of the matrix below:

+
+

OUTPUT +

+
     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7
+
+

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

+
  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. +
  3. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  4. +
  5. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  6. +
  7. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
  8. +
+
+
+
+
+ +
+
+

Consider the R output of the matrix below:

+
+

OUTPUT +

+
     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7
+
+

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

+
+

R +

+
+matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +
+
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/05-data-structures-part2.html b/05-data-structures-part2.html new file mode 100644 index 000000000..c85c29794 --- /dev/null +++ b/05-data-structures-part2.html @@ -0,0 +1,1209 @@ + +R for Reproducible Scientific Analysis: Exploring Data Frames +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Exploring Data Frames

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I manipulate a data frame?
  • +
+
+
+
+
+
+

Objectives

+
  • Add and remove rows or columns.
  • +
  • Append two data frames.
  • +
  • Display basic properties of data frames including size and class of +the columns, names, and first few rows.
  • +
+
+
+
+
+

At this point, you’ve seen it all: in the last lesson, we toured all +the basic data types and data structures in R. Everything you do will be +a manipulation of those tools. But most of the time, the star of the +show is the data frame—the table that we created by loading information +from a csv file. In this lesson, we’ll learn a few more things about +working with data frames.

+

Adding columns and rows in data frames +

+

We already learned that the columns of a data frame are vectors, so +that our data are consistent in type throughout the columns. As such, if +we want to add a new column, we can start by making a new vector:

+
+

R +

+
+age <- c(2, 3, 5)
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1
+
+

We can then add this as a column via:

+
+

R +

+
+cbind(cats, age)
+
+
+

OUTPUT +

+
    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5
+
+

Note that if we tried to add a vector of ages with a different number +of entries than the number of rows in the data frame, it would fail:

+
+

R +

+
+age <- c(2, 3, 5, 12)
+cbind(cats, age)
+
+
+

ERROR +

+
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4
+
+
+

R +

+
+age <- c(2, 3)
+cbind(cats, age)
+
+
+

ERROR +

+
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
+
+

Why didn’t this work? Of course, R wants to see one element in our +new column for every row in the table:

+
+

R +

+
+nrow(cats)
+
+
+

OUTPUT +

+
[1] 3
+
+
+

R +

+
+length(age)
+
+
+

OUTPUT +

+
[1] 2
+
+

So for it to work we need to have nrow(cats) = +length(age). Let’s overwrite the content of cats with our +new data frame.

+
+

R +

+
+age <- c(2, 3, 5)
+cats <- cbind(cats, age)
+
+

Now how about adding rows? We already know that the rows of a data +frame are lists:

+
+

R +

+
+newRow <- list("tortoiseshell", 3.3, TRUE, 9)
+cats <- rbind(cats, newRow)
+
+

Let’s confirm that our new row was added correctly.

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+
+

Removing rows +

+

We now know how to add rows and columns to our data frame in R. Now +let’s learn to remove rows.

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+
+

We can ask for a data frame minus the last row:

+
+

R +

+
+cats[-4, ]
+
+
+

OUTPUT +

+
    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5
+
+

Notice the comma with nothing after it to indicate that we want to +drop the entire fourth row.

+

Note: we could also remove several rows at once by putting the row +numbers inside of a vector, for example: +cats[c(-3,-4), ]

+

Removing columns +

+

We can also remove columns in our data frame. What if we want to +remove the column “age”. We can remove it in two ways, by variable +number or by index.

+
+

R +

+
+cats[,-4]
+
+
+

OUTPUT +

+
           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1
+
+

Notice the comma with nothing before it, indicating we want to keep +all of the rows.

+

Alternatively, we can drop the column by using the index name and the +%in% operator. The %in% operator goes through +each element of its left argument, in this case the names of +cats, and asks, “Does this element occur in the second +argument?”

+
+

R +

+
+drop <- names(cats) %in% c("age")
+cats[,!drop]
+
+
+

OUTPUT +

+
           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1
+
+

We will cover subsetting with logical operators like +%in% in more detail in the next episode. See the section Subsetting through other logical +operations

+

Appending to a data frame +

+

The key to remember when adding data to a data frame is that +columns are vectors and rows are lists. We can also glue two +data frames together with rbind:

+
+

R +

+
+cats <- rbind(cats, cats)
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+
+

But now the row names are unnecessarily complicated. We can remove +the rownames, and R will automatically re-name them sequentially:

+
+

R +

+
+rownames(cats) <- NULL
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+
+
+
+ +
+
+

Challenge 1 +

+
+

You can create a new data frame right from within R with the +following syntax:

+
+

R +

+
+df <- data.frame(id = c("a", "b", "c"),
+                 x = 1:3,
+                 y = c(TRUE, TRUE, FALSE))
+
+

Make a data frame that holds the following information for +yourself:

+
  • first name
  • +
  • last name
  • +
  • lucky number
  • +

Then use rbind to add an entry for the people sitting +beside you. Finally, use cbind to add a column with each +person’s answer to the question, “Is it time for coffee break?”

+
+
+
+
+
+ +
+
+
+

R +

+
+df <- data.frame(first = c("Grace"),
+                 last = c("Hopper"),
+                 lucky_number = c(0))
+df <- rbind(df, list("Marie", "Curie", 238) )
+df <- cbind(df, coffeetime = c(TRUE,TRUE))
+
+
+
+
+
+

Realistic example +

+

So far, you have seen the basics of manipulating data frames with our +cat data; now let’s use those skills to digest a more realistic dataset. +Let’s read in the gapminder dataset that we downloaded +previously:

+
+

R +

+
+gapminder <- read.csv("data/gapminder_data.csv")
+
+
+
+ +
+
+

Miscellaneous Tips +

+
+
  • Another type of file you might encounter are tab-separated value +files (.tsv). To specify a tab as a separator, use "\\t" or +read.delim().

  • +
  • Files can also be downloaded directly from the Internet into a +local folder of your choice onto your computer using the +download.file function. The read.csv function +can then be executed to read the downloaded file from the download +location, for example,

  • +
+

R +

+
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv("data/gapminder_data.csv")
+
+
  • Alternatively, you can also read in files directly into R from the +Internet by replacing the file paths with a web address in +read.csv. One should note that in doing this no local copy +of the csv file is first saved onto your computer. For example,
  • +
+

R +

+
+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
+
+
  • You can read directly from excel spreadsheets without converting +them to plain text first by using the readxl +package.

  • +
  • The argument “stringsAsFactors” can be useful to tell R how to +read strings either as factors or as character strings. In R versions +after 4.0, all strings are read-in as characters by default, but in +earlier versions of R, strings are read-in as factors by default. For +more information, see the call-out in the +previous episode.

  • +
+
+
+

Let’s investigate gapminder a bit; the first thing we should always +do is check out what the data looks like with str:

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+

An additional method for examining the structure of gapminder is to +use the summary function. This function can be used on +various objects in R. For data frames, summary yields a +numeric, tabular, or descriptive summary of each column. Numeric or +integer columns are described by the descriptive statistics (quartiles +and mean), and character columns by its length, class, and mode.

+
+

R +

+
+summary(gapminder)
+
+
+

OUTPUT +

+
   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  
+
+

Along with the str and summary functions, +we can examine individual columns of the data frame with our +typeof function:

+
+

R +

+
+typeof(gapminder$year)
+
+
+

OUTPUT +

+
[1] "integer"
+
+
+

R +

+
+typeof(gapminder$country)
+
+
+

OUTPUT +

+
[1] "character"
+
+
+

R +

+
+str(gapminder$country)
+
+
+

OUTPUT +

+
 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+
+

We can also interrogate the data frame for information about its +dimensions; remembering that str(gapminder) said there were +1704 observations of 6 variables in gapminder, what do you think the +following will produce, and why?

+
+

R +

+
+length(gapminder)
+
+
+

OUTPUT +

+
[1] 6
+
+

A fair guess would have been to say that the length of a data frame +would be the number of rows it has (1704), but this is not the case; +remember, a data frame is a list of vectors and factors:

+
+

R +

+
+typeof(gapminder)
+
+
+

OUTPUT +

+
[1] "list"
+
+

When length gave us 6, it’s because gapminder is built +out of a list of 6 columns. To get the number of rows and columns in our +dataset, try:

+
+

R +

+
+nrow(gapminder)
+
+
+

OUTPUT +

+
[1] 1704
+
+
+

R +

+
+ncol(gapminder)
+
+
+

OUTPUT +

+
[1] 6
+
+

Or, both at once:

+
+

R +

+
+dim(gapminder)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+

We’ll also likely want to know what the titles of all the columns +are, so we can ask for them later:

+
+

R +

+
+colnames(gapminder)
+
+
+

OUTPUT +

+
[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
+
+

At this stage, it’s important to ask ourselves if the structure R is +reporting matches our intuition or expectations; do the basic data types +reported for each column make sense? If not, we need to sort any +problems out now before they turn into bad surprises down the road, +using what we’ve learned about how R interprets data, and the importance +of strict consistency in how we record our data.

+

Once we’re happy that the data types and structures seem reasonable, +it’s time to start digging into our data proper. Check out the first few +lines:

+
+

R +

+
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+
+
+
+ +
+
+

Challenge 2 +

+
+

It’s good practice to also check the last few lines of your data and +some in the middle. How would you do this?

+

Searching for ones specifically in the middle isn’t too hard, but we +could ask for a few lines at random. How would you code this?

+
+
+
+
+
+ +
+
+

To check the last few lines it’s relatively simple as R already has a +function for this:

+
+

R +

+
+tail(gapminder)
+tail(gapminder, n = 15)
+
+

What about a few arbitrary rows just in case something is odd in the +middle?

+
+

Tip: There are several ways to achieve this.

+

The solution here presents one form of using nested functions, i.e. a +function passed as an argument to another function. This might sound +like a new concept, but you are already using it! Remember +my_dataframe[rows, cols] will print to screen your data frame with the +number of rows and columns you asked for (although you might have asked +for a range or named columns for example). How would you get the last +row if you don’t know how many rows your data frame has? R has a +function for this. What about getting a (pseudorandom) sample? R also +has a function for this.

+
+

R +

+
+gapminder[sample(nrow(gapminder), 5), ]
+
+
+
+
+
+
+

To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later.

+
+
+ +
+
+

Challenge 3 +

+
+

Go to file -> new file -> R script, and write an R script to +load in the gapminder dataset. Put it in the scripts/ +directory and add it to version control.

+

Run the script using the source function, using the file +path as its argument (or by pressing the “source” button in +RStudio).

+
+
+
+
+
+ +
+
+

The source function can be used to use a script within a +script. Assume you would like to load the same type of file over and +over again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again and +again you could just write it once and save it as a script. Then, you +can use source("Your_Script_containing_the_load_function") +in a new script to use the function of that script without writing +everything again. Check out ?source to find out more.

+
+

R +

+
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv(file = "data/gapminder_data.csv")
+
+

To run the script and load the data into the gapminder +variable:

+
+

R +

+
+source(file = "scripts/load-gapminder.R")
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Read the output of str(gapminder) again; this time, use +what you’ve learned about lists and vectors, as well as the output of +functions like colnames and dim to explain +what everything that str prints out for gapminder means. If +there are any parts you can’t interpret, discuss with your +neighbors!

+
+
+
+
+
+ +
+
+

The object gapminder is a data frame with columns

+
  • +country and continent are character +strings.
  • +
  • +year is an integer vector.
  • +
  • +pop, lifeExp, and gdpPercap +are numeric vectors.
  • +
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/06-data-subsetting.html b/06-data-subsetting.html new file mode 100644 index 000000000..1136db11c --- /dev/null +++ b/06-data-subsetting.html @@ -0,0 +1,1991 @@ + +R for Reproducible Scientific Analysis: Subsetting Data +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Subsetting Data

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I work with subsets of data in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to subset vectors, factors, matrices, lists, and data +frames
  • +
  • To be able to extract individual and multiple elements: by index, by +name, using comparison operations
  • +
  • To be able to skip and remove elements from various data +structures.
  • +
+
+
+
+
+

R has many powerful subset operators. Mastering them will allow you +to easily perform complex operations on any kind of dataset.

+

There are six different ways we can subset any kind of object, and +three different subsetting operators for the different data +structures.

+

Let’s start with the workhorse of R: a simple numeric vector.

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+x
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+
+
+ +
+
+

Atomic vectors +

+
+

In R, simple vectors containing character strings, numbers, or +logical values are called atomic vectors because they can’t be +further simplified.

+
+
+
+

So now that we’ve created a dummy vector to play with, how do we get +at its contents?

+

Accessing elements using their indices +

+

To extract elements of a vector we can give their corresponding +index, starting from one:

+
+

R +

+
+x[1]
+
+
+

OUTPUT +

+
  a 
+5.4 
+
+
+

R +

+
+x[4]
+
+
+

OUTPUT +

+
  d 
+4.8 
+
+

It may look different, but the square brackets operator is a +function. For vectors (and matrices), it means “get me the nth +element”.

+

We can ask for multiple elements at once:

+
+

R +

+
+x[c(1, 3)]
+
+
+

OUTPUT +

+
  a   c 
+5.4 7.1 
+
+

Or slices of the vector:

+
+

R +

+
+x[1:4]
+
+
+

OUTPUT +

+
  a   b   c   d 
+5.4 6.2 7.1 4.8 
+
+

the : operator creates a sequence of numbers from the +left element to the right.

+
+

R +

+
+1:4
+
+
+

OUTPUT +

+
[1] 1 2 3 4
+
+
+

R +

+
+c(1, 2, 3, 4)
+
+
+

OUTPUT +

+
[1] 1 2 3 4
+
+

We can ask for the same element multiple times:

+
+

R +

+
+x[c(1,1,3)]
+
+
+

OUTPUT +

+
  a   a   c 
+5.4 5.4 7.1 
+
+

If we ask for an index beyond the length of the vector, R will return +a missing value:

+
+

R +

+
+x[6]
+
+
+

OUTPUT +

+
<NA> 
+  NA 
+
+

This is a vector of length one containing an NA, whose +name is also NA.

+

If we ask for the 0th element, we get an empty vector:

+
+

R +

+
+x[0]
+
+
+

OUTPUT +

+
named numeric(0)
+
+
+
+ +
+
+

Vector numbering in R starts at 1 +

+
+

In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1.

+
+
+
+

Skipping and removing elements +

+

If we use a negative number as the index of a vector, R will return +every element except for the one specified:

+
+

R +

+
+x[-2]
+
+
+

OUTPUT +

+
  a   c   d   e 
+5.4 7.1 4.8 7.5 
+
+

We can skip multiple elements:

+
+

R +

+
+x[c(-1, -5)]  # or x[-c(1,5)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+
+ +
+
+

Tip: Order of operations +

+
+

A common trip up for novices occurs when trying to skip slices of a +vector. It’s natural to try to negate a sequence like so:

+
+

R +

+
+x[-1:3]
+
+

This gives a somewhat cryptic error:

+
+

ERROR +

+
Error in x[-1:3]: only 0's may be mixed with negative subscripts
+
+

But remember the order of operations. : is really a +function. It takes its first argument as -1, and its second as 3, so +generates the sequence of numbers: c(-1, 0, 1, 2, 3).

+

The correct solution is to wrap that function call in brackets, so +that the - operator applies to the result:

+
+

R +

+
+x[-(1:3)]
+
+
+

OUTPUT +

+
  d   e 
+4.8 7.5 
+
+
+
+
+

To remove elements from a vector, we need to assign the result back +into the variable:

+
+

R +

+
+x <- x[-4]
+x
+
+
+

OUTPUT +

+
  a   b   c   e 
+5.4 6.2 7.1 7.5 
+
+
+
+ +
+
+

Challenge 1 +

+
+

Given the following code:

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+

Come up with at least 2 different commands that will produce the +following output:

+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+

After you find 2 different commands, compare notes with your +neighbour. Did you have different strategies?

+
+
+
+
+
+ +
+
+
+

R +

+
+x[2:4]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+

R +

+
+x[-c(1,5)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+

R +

+
+x[c(2,3,4)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+
+
+
+

Subsetting by name +

+

We can extract elements by using their name, instead of extracting by +index:

+
+

R +

+
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
+x[c("a", "c")]
+
+
+

OUTPUT +

+
  a   c 
+5.4 7.1 
+
+

This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same!

+

Subsetting through other logical operations +

+

We can also use any logical vector to subset:

+
+

R +

+
+x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
+
+
+

OUTPUT +

+
  c   e 
+7.1 7.5 
+
+

Since comparison operators (e.g. >, +<, ==) evaluate to logical vectors, we can +also use them to succinctly subset vectors: the following statement +gives the same result as the previous one.

+
+

R +

+
+x[x > 7]
+
+
+

OUTPUT +

+
  c   e 
+7.1 7.5 
+
+

Breaking it down, this statement first evaluates x>7, +generating a logical vector +c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the +elements of x corresponding to the TRUE +values.

+

We can use == to mimic the previous method of indexing +by name (remember you have to use == rather than += for comparisons):

+
+

R +

+
+x[names(x) == "a"]
+
+
+

OUTPUT +

+
  a 
+5.4 
+
+
+
+ +
+
+

Tip: Combining logical conditions +

+
+

We often want to combine multiple logical criteria. For example, we +might want to find all the countries that are located in Asia +or Europe and have life expectancies +within a certain range. Several operations for combining logical vectors +exist in R:

+
  • +&, the “logical AND” operator: returns +TRUE if both the left and right are TRUE.
  • +
  • +|, the “logical OR” operator: returns +TRUE, if either the left or right (or both) are +TRUE.
  • +

You may sometimes see && and || +instead of & and |. These two-character +operators only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them for programming, i.e. deciding +whether to execute a statement.

+
  • +!, the “logical NOT” operator: converts +TRUE to FALSE and FALSE to +TRUE. It can negate a single logical condition (eg +!TRUE becomes FALSE), or a whole vector of +conditions(eg !c(TRUE, FALSE) becomes +c(FALSE, TRUE)).
  • +

Additionally, you can compare the elements within a single vector +using the all function (which returns TRUE if +every element of the vector is TRUE) and the +any function (which returns TRUE if one or +more elements of the vector are TRUE).

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Given the following code:

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+

Write a subsetting command to return the values in x that are greater +than 4 and less than 7.

+
+
+
+
+
+ +
+
+
+

R +

+
+x_subset <- x[x<7 & x>4]
+print(x_subset)
+
+
+

OUTPUT +

+
  a   b   d 
+5.4 6.2 4.8 
+
+
+
+
+
+
+
+ +
+
+

Tip: Non-unique names +

+
+

You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have the +same name — although R tries to avoid this — but row names must be +unique.) Consider these examples:

+
+

R +

+
+x <- 1:3
+x
+
+
+

OUTPUT +

+
[1] 1 2 3
+
+
+

R +

+
+names(x) <- c('a', 'a', 'a')
+x
+
+
+

OUTPUT +

+
a a a 
+1 2 3 
+
+
+

R +

+
+x['a']  # only returns first value
+
+
+

OUTPUT +

+
a 
+1 
+
+
+

R +

+
+x[names(x) == 'a']  # returns all three values
+
+
+

OUTPUT +

+
a a a 
+1 2 3 
+
+
+
+
+
+
+ +
+
+

Tip: Getting help for operators +

+
+

Remember you can search for help on operators by wrapping them in +quotes: help("%in%") or ?"%in%".

+
+
+
+

Skipping named elements +

+

Skipping or removing named elements is a little harder. If we try to +skip one named element by negating the string, R complains (slightly +obscurely) that it doesn’t know how to take the negative of a +string:

+
+

R +

+
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
+x[-"a"]
+
+
+

ERROR +

+
Error in -"a": invalid argument to unary operator
+
+

However, we can use the != (not-equals) operator to +construct a logical vector that will do what we want:

+
+

R +

+
+x[names(x) != "a"]
+
+
+

OUTPUT +

+
  b   c   d   e 
+6.2 7.1 4.8 7.5 
+
+

Skipping multiple named indices is a little bit harder still. Suppose +we want to drop the "a" and "c" elements, so +we try this:

+
+

R +

+
+x[names(x)!=c("a","c")]
+
+
+

WARNING +

+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+

OUTPUT +

+
  b   c   d   e 
+6.2 7.1 4.8 7.5 
+
+

R did something, but it gave us a warning that we ought to +pay attention to - and it apparently gave us the wrong answer +(the "c" element is still included in the vector)!

+

So what does != actually do in this case? That’s an +excellent question.

+
+

Recycling

+

Let’s take a look at the comparison component of this code:

+
+

R +

+
+names(x) != c("a", "c")
+
+
+

WARNING +

+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+

OUTPUT +

+
[1] FALSE  TRUE  TRUE  TRUE  TRUE
+
+

Why does R give TRUE as the third element of this +vector, when names(x)[3] != "c" is obviously false? When +you use !=, R tries to compare each element of the left +argument with the corresponding element of its right argument. What +happens when you compare vectors of different lengths?

+
Inequality testing

When one vector is shorter than the other, it gets +recycled:

+
Inequality testing: results of recycling

In this case R repeats c("a", "c") as +many times as necessary to match names(x), i.e. we get +c("a","c","a","c","a"). Since the recycled "a" +doesn’t match the third element of names(x), the value of +!= is TRUE. Because in this case the longer +vector length (5) isn’t a multiple of the shorter vector length (2), R +printed a warning message. If we had been unlucky and +names(x) had contained six elements, R would +silently have done the wrong thing (i.e., not what we intended +it to do). This recycling rule can can introduce hard-to-find and subtle +bugs!

+

The way to get R to do what we really want (match each +element of the left argument with all of the elements of the +right argument) it to use the %in% operator. The +%in% operator goes through each element of its left +argument, in this case the names of x, and asks, “Does this +element occur in the second argument?”. Here, since we want to +exclude values, we also need a ! operator to +change “in” to “not in”:

+
+

R +

+
+x[! names(x) %in% c("a","c") ]
+
+
+

OUTPUT +

+
  b   d   e 
+6.2 4.8 7.5 
+
+
+
+ +
+
+

Challenge 3 +

+
+

Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains country and continent variables, but +no information between these two scales. Suppose we want to pull out +information from southeast Asia: how do we set up an operation to +produce a logical vector that is TRUE for all of the +countries in southeast Asia and FALSE otherwise?

+

Suppose you have these data:

+
+

R +

+
+seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries <- unique(as.character(gapminder$country))
+
+

There’s a wrong way (using only ==), which will give you +a warning; a clunky way (using the logical operators == and +|); and an elegant way (using %in%). See +whether you can come up with all three and explain how they (don’t) +work.

+
+
+
+
+
+ +
+
+
  • The wrong way to do this problem is +countries==seAsia. This gives a warning +("In countries == seAsia : longer object length is not a multiple of shorter object length") +and the wrong answer (a vector of all FALSE values), +because none of the recycled values of seAsia happen to +line up correctly with matching values in country.
  • +
  • The clunky (but technically correct) way to do this +problem is
  • +
+

R +

+
+ (countries=="Myanmar" | countries=="Thailand" |
+ countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")
+
+

(or countries==seAsia[1] | countries==seAsia[2] | ...). +This gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?).

+
  • The best way to do this problem is +countries %in% seAsia, which is both correct and easy to +type (and read).
  • +
+
+
+
+
+

Handling special values +

+

At some point you will encounter functions in R that cannot handle +missing, infinite, or undefined data.

+

There are a number of special functions you can use to filter out +this data:

+
  • +is.na will return all positions in a vector, matrix, or +data.frame containing NA (or NaN)
  • +
  • likewise, is.nan, and is.infinite will do +the same for NaN and Inf.
  • +
  • +is.finite will return all positions in a vector, +matrix, or data.frame that do not contain NA, +NaN or Inf.
  • +
  • +na.omit will filter out all missing values from a +vector
  • +

Factor subsetting +

+

Now that we’ve explored the different ways to subset vectors, how do +we subset the other data structures?

+

Factor subsetting works the same way as vector subsetting.

+
+

R +

+
+f <- factor(c("a", "a", "b", "c", "c", "d"))
+f[f == "a"]
+
+
+

OUTPUT +

+
[1] a a
+Levels: a b c d
+
+
+

R +

+
+f[f %in% c("b", "c")]
+
+
+

OUTPUT +

+
[1] b c c
+Levels: a b c d
+
+
+

R +

+
+f[1:3]
+
+
+

OUTPUT +

+
[1] a a b
+Levels: a b c d
+
+

Skipping elements will not remove the level even if no more of that +category exists in the factor:

+
+

R +

+
+f[-3]
+
+
+

OUTPUT +

+
[1] a a c c d
+Levels: a b c d
+
+

Matrix subsetting +

+

Matrices are also subsetted using the [ function. In +this case it takes two arguments: the first applying to the rows, the +second to its columns:

+
+

R +

+
+set.seed(1)
+m <- matrix(rnorm(6*4), ncol=4, nrow=6)
+m[3:4, c(3,1)]
+
+
+

OUTPUT +

+
            [,1]       [,2]
+[1,]  1.12493092 -0.8356286
+[2,] -0.04493361  1.5952808
+
+

You can leave the first or second arguments blank to retrieve all the +rows or columns respectively:

+
+

R +

+
+m[, c(3,4)]
+
+
+

OUTPUT +

+
            [,1]        [,2]
+[1,] -0.62124058  0.82122120
+[2,] -2.21469989  0.59390132
+[3,]  1.12493092  0.91897737
+[4,] -0.04493361  0.78213630
+[5,] -0.01619026  0.07456498
+[6,]  0.94383621 -1.98935170
+
+

If we only access one row or column, R will automatically convert the +result to a vector:

+
+

R +

+
+m[3,]
+
+
+

OUTPUT +

+
[1] -0.8356286  0.5757814  1.1249309  0.9189774
+
+

If you want to keep the output as a matrix, you need to specify a +third argument; drop = FALSE:

+
+

R +

+
+m[3, , drop=FALSE]
+
+
+

OUTPUT +

+
           [,1]      [,2]     [,3]      [,4]
+[1,] -0.8356286 0.5757814 1.124931 0.9189774
+
+

Unlike vectors, if we try to access a row or column outside of the +matrix, R will throw an error:

+
+

R +

+
+m[, c(3,6)]
+
+
+

ERROR +

+
Error in m[, c(3, 6)]: subscript out of bounds
+
+
+
+ +
+
+

Tip: Higher dimensional arrays +

+
+

when dealing with multi-dimensional arrays, each argument to +[ corresponds to a dimension. For example, a 3D array, the +first three arguments correspond to the rows, columns, and depth +dimension.

+
+
+
+

Because matrices are vectors, we can also subset using only one +argument:

+
+

R +

+
+m[5]
+
+
+

OUTPUT +

+
[1] 0.3295078
+
+

This usually isn’t useful, and often confusing to read. However it is +useful to note that matrices are laid out in column-major +format by default. That is the elements of the vector are arranged +column-wise:

+
+

R +

+
+matrix(1:6, nrow=2, ncol=3)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3]
+[1,]    1    3    5
+[2,]    2    4    6
+
+

If you wish to populate the matrix by row, use +byrow=TRUE:

+
+

R +

+
+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3]
+[1,]    1    2    3
+[2,]    4    5    6
+
+

Matrices can also be subsetted using their rownames and column names +instead of their row and column indices.

+
+
+ +
+
+

Challenge 4 +

+
+

Given the following code:

+
+

R +

+
+m <- matrix(1:18, nrow=3, ncol=6)
+print(m)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    1    4    7   10   13   16
+[2,]    2    5    8   11   14   17
+[3,]    3    6    9   12   15   18
+
+
  1. Which of the following commands will extract the values 11 and +14?
  2. +

A. m[2,4,2,5]

+

B. m[2:5]

+

C. m[4:5,2]

+

D. m[2,c(4,5)]

+
+
+
+
+
+ +
+
+

D

+
+
+
+
+

List subsetting +

+

Now we’ll introduce some new subsetting operators. There are three +functions used to subset lists. We’ve already seen these when learning +about atomic vectors and matrices: [, [[, and +$.

+

Using [ will always return a list. If you want to +subset a list, but not extract an element, then you +will likely use [.

+
+

R +

+
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+xlist[1]
+
+
+

OUTPUT +

+
$a
+[1] "Software Carpentry"
+
+

This returns a list with one element.

+

We can subset elements of a list exactly the same way as atomic +vectors using [. Comparison operations however won’t work +as they’re not recursive, they will try to condition on the data +structures in each element of the list, not the individual elements +within those data structures.

+
+

R +

+
+xlist[1:2]
+
+
+

OUTPUT +

+
$a
+[1] "Software Carpentry"
+
+$b
+ [1]  1  2  3  4  5  6  7  8  9 10
+
+

To extract individual elements of a list, you need to use the +double-square bracket function: [[.

+
+

R +

+
+xlist[[1]]
+
+
+

OUTPUT +

+
[1] "Software Carpentry"
+
+

Notice that now the result is a vector, not a list.

+

You can’t extract more than one element at once:

+
+

R +

+
+xlist[[1:2]]
+
+
+

ERROR +

+
Error in xlist[[1:2]]: subscript out of bounds
+
+

Nor use it to skip elements:

+
+

R +

+
+xlist[[-1]]
+
+
+

ERROR +

+
Error in xlist[[-1]]: invalid negative subscript in get1index <real>
+
+

But you can use names to both subset and extract elements:

+
+

R +

+
+xlist[["a"]]
+
+
+

OUTPUT +

+
[1] "Software Carpentry"
+
+

The $ function is a shorthand way for extracting +elements by name:

+
+

R +

+
+xlist$data
+
+
+

OUTPUT +

+
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
+Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+
+
+
+ +
+
+

Challenge 5 +

+
+

Given the following list:

+
+

R +

+
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+
+

Using your knowledge of both list and vector subsetting, extract the +number 2 from xlist. Hint: the number 2 is contained within the “b” item +in the list.

+
+
+
+
+
+ +
+
+
+

R +

+
+xlist$b[2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+

R +

+
+xlist[[2]][2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+

R +

+
+xlist[["b"]][2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+
+
+
+
+
+ +
+
+

Challenge 6 +

+
+

Given a linear model:

+
+

R +

+
+mod <- aov(pop ~ lifeExp, data=gapminder)
+
+

Extract the residual degrees of freedom (hint: +attributes() will help you)

+
+
+
+
+
+ +
+
+
+

R +

+
+attributes(mod) ## `df.residual` is one of the names of `mod`
+
+
+

R +

+
+mod$df.residual
+
+
+
+
+
+

Data frames +

+

Remember the data frames are lists underneath the hood, so similar +rules apply. However they are also two dimensional objects:

+

[ with one argument will act the same way as for lists, +where each list element corresponds to a column. The resulting object +will be a data frame:

+
+

R +

+
+head(gapminder[3])
+
+
+

OUTPUT +

+
       pop
+1  8425333
+2  9240934
+3 10267083
+4 11537966
+5 13079460
+6 14880372
+
+

Similarly, [[ will act to extract a single +column:

+
+

R +

+
+head(gapminder[["lifeExp"]])
+
+
+

OUTPUT +

+
[1] 28.801 30.332 31.997 34.020 36.088 38.438
+
+

And $ provides a convenient shorthand to extract columns +by name:

+
+

R +

+
+head(gapminder$year)
+
+
+

OUTPUT +

+
[1] 1952 1957 1962 1967 1972 1977
+
+

With two arguments, [ behaves the same way as for +matrices:

+
+

R +

+
+gapminder[1:3,]
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+
+

If we subset a single row, the result will be a data frame (because +the elements are mixed types):

+
+

R +

+
+gapminder[3,]
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+
+

But for a single column the result will be a vector (this can be +changed with the third argument, drop = FALSE).

+
+
+ +
+
+

Challenge 7 +

+
+

Fix each of the following common data frame subsetting errors:

+
  1. Extract observations collected for the year 1957
  2. +
+

R +

+
gapminder[gapminder$year = 1957,]
+
+
  1. Extract all columns except 1 through to 4
  2. +
+

R +

+
+gapminder[,-1:4]
+
+
  1. Extract the rows where the life expectancy is longer the 80 +years
  2. +
+

R +

+
+gapminder[gapminder$lifeExp > 80]
+
+
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +
+

R +

+
+gapminder[1, 4, 5]
+
+
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +
+

R +

+
+gapminder[gapminder$year == 2002 | 2007,]
+
+
+
+
+
+
+ +
+
+

Fix each of the following common data frame subsetting errors:

+
  1. Extract observations collected for the year 1957
  2. +
+

R +

+
+# gapminder[gapminder$year = 1957,]
+gapminder[gapminder$year == 1957,]
+
+
  1. Extract all columns except 1 through to 4
  2. +
+

R +

+
+# gapminder[,-1:4]
+gapminder[,-c(1:4)]
+
+
  1. Extract the rows where the life expectancy is longer than 80 +years
  2. +
+

R +

+
+# gapminder[gapminder$lifeExp > 80]
+gapminder[gapminder$lifeExp > 80,]
+
+
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +
+

R +

+
+# gapminder[1, 4, 5]
+gapminder[1, c(4, 5)]
+
+
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +
+

R +

+
+# gapminder[gapminder$year == 2002 | 2007,]
+gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
+gapminder[gapminder$year %in% c(2002, 2007),]
+
+
+
+
+
+
+
+ +
+
+

Challenge 8 +

+
+
  1. Why does gapminder[1:20] return an error? How does +it differ from gapminder[1:20, ]?

  2. +
  3. Create a new data.frame called +gapminder_small that only contains rows 1 through 9 and 19 +through 23. You can do this in one or two steps.

  4. +
+
+
+
+
+ +
+
+
  1. gapminder is a data.frame so needs to be subsetted +on two dimensions. gapminder[1:20, ] subsets the data to +give the first 20 rows and all columns.

  2. +
  3. +
  4. +
+

R +

+
+gapminder_small <- gapminder[c(1:9, 19:23),]
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/07-control-flow.html b/07-control-flow.html new file mode 100644 index 000000000..590210c89 --- /dev/null +++ b/07-control-flow.html @@ -0,0 +1,1247 @@ + +R for Reproducible Scientific Analysis: Control Flow +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Control Flow

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I make data-dependent choices in R?
  • +
  • How can I repeat operations in R?
  • +
+
+
+
+
+
+

Objectives

+
  • Write conditional statements with if...else statements +and ifelse().
  • +
  • Write and understand for() loops.
  • +
+
+
+
+
+

Often when we’re coding we want to control the flow of our actions. +This can be done by setting actions to occur only if a condition or a +set of conditions are met. Alternatively, we can also set an action to +occur a particular number of times.

+

There are several ways you can control flow in R. For conditional +statements, the most commonly used approaches are the constructs:

+
+

R +

+
# if
+if (condition is true) {
+  perform action
+}
+
+# if ... else
+if (condition is true) {
+  perform action
+} else {  # that is, if the condition is false,
+  perform alternative action
+}
+
+

Say, for example, that we want R to print a message if a variable +x has a particular value:

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+}
+
+x
+
+
+

OUTPUT +

+
[1] 8
+
+

The print statement does not appear in the console because x is not +greater than 10. To print a different message for numbers less than 10, +we can add an else statement.

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else {
+  print("x is less than 10")
+}
+
+
+

OUTPUT +

+
[1] "x is less than 10"
+
+

You can also test multiple conditions by using +else if.

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else if (x > 5) {
+  print("x is greater than 5, but less than 10")
+} else {
+  print("x is less than 5")
+}
+
+
+

OUTPUT +

+
[1] "x is greater than 5, but less than 10"
+
+

Important: when R evaluates the condition inside +if() statements, it is looking for a logical element, i.e., +TRUE or FALSE. This can cause some headaches +for beginners. For example:

+
+

R +

+
+x  <-  4 == 3
+if (x) {
+  "4 equals 3"
+} else {
+  "4 does not equal 3"
+}
+
+
+

OUTPUT +

+
[1] "4 does not equal 3"
+
+

As we can see, the not equal message was printed because the vector x +is FALSE

+
+

R +

+
+x <- 4 == 3
+x
+
+
+

OUTPUT +

+
[1] FALSE
+
+
+
+ +
+
+

Challenge 1 +

+
+

Use an if() statement to print a suitable message +reporting whether there are any records from 2002 in the +gapminder dataset. Now do the same for 2012.

+
+
+
+
+
+ +
+
+

We will first see a solution to Challenge 1 which does not use the +any() function. We first obtain a logical vector describing +which element of gapminder$year is equal to +2002:

+
+

R +

+
+gapminder[(gapminder$year == 2002),]
+
+

Then, we count the number of rows of the data.frame +gapminder that correspond to the 2002:

+
+

R +

+
+rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])
+
+

The presence of any record for the year 2002 is equivalent to the +request that rows2002_number is one or more:

+
+

R +

+
+rows2002_number >= 1
+
+

Putting all together, we obtain:

+
+

R +

+
+if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
+   print("Record(s) for the year 2002 found.")
+}
+
+

All this can be done more quickly with any(). The +logical condition can be expressed as:

+
+

R +

+
+if(any(gapminder$year == 2002)){
+   print("Record(s) for the year 2002 found.")
+}
+
+
+
+
+
+

Did anyone get a warning message like this?

+
+

ERROR +

+
Error in if (gapminder$year == 2012) {: the condition has length > 1
+
+

The if() function only accepts singular (of length 1) +inputs, and therefore returns an error when you use it with a vector. +The if() function will still run, but will only evaluate +the condition in the first element of the vector. Therefore, to use the +if() function, you need to make sure your input is singular +(of length 1).

+
+
+ +
+
+

Tip: Built in ifelse() +function +

+
+

R accepts both if() and +else if() statements structured as outlined above, but also +statements using R’s built-in ifelse() +function. This function accepts both singular and vector inputs and is +structured as follows:

+
+

R +

+
# ifelse function
+ifelse(condition is true, perform action, perform alternative action)
+
+

where the first argument is the condition or a set of conditions to +be met, the second argument is the statement that is evaluated when the +condition is TRUE, and the third statement is the statement +that is evaluated when the condition is FALSE.

+
+

R +

+
+y <- -3
+ifelse(y < 0, "y is a negative number", "y is either positive or zero")
+
+
+

OUTPUT +

+
[1] "y is a negative number"
+
+
+
+
+
+
+ +
+
+

Tip: any() and +all() +

+
+

The any() function will return TRUE if at +least one TRUE value is found within a vector, otherwise it +will return FALSE. This can be used in a similar way to the +%in% operator. The function all(), as the name +suggests, will only return TRUE if all values in the vector +are TRUE.

+
+
+
+

Repeating operations +

+

If you want to iterate over a set of values, when the order of +iteration is important, and perform the same operation on each, a +for() loop will do the job. We saw for() loops +in the shell +lessons earlier. This is the most flexible of looping operations, +but therefore also the hardest to use correctly. In general, the advice +of many R users would be to learn about for() +loops, but to avoid using for() loops unless the order of +iteration is important: i.e. the calculation at each iteration depends +on the results of previous iterations. If the order of iteration is not +important, then you should learn about vectorized alternatives, such as +the purrr package, as they pay off in computational +efficiency.

+

The basic structure of a for() loop is:

+
+

R +

+
for (iterator in set of values) {
+  do a thing
+}
+
+

For example:

+
+

R +

+
+for (i in 1:10) {
+  print(i)
+}
+
+
+

OUTPUT +

+
[1] 1
+[1] 2
+[1] 3
+[1] 4
+[1] 5
+[1] 6
+[1] 7
+[1] 8
+[1] 9
+[1] 10
+
+

The 1:10 bit creates a vector on the fly; you can +iterate over any other vector as well.

+

We can use a for() loop nested within another +for() loop to iterate over two things at once.

+
+

R +

+
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    print(paste(i,j))
+  }
+}
+
+
+

OUTPUT +

+
[1] "1 a"
+[1] "1 b"
+[1] "1 c"
+[1] "1 d"
+[1] "1 e"
+[1] "2 a"
+[1] "2 b"
+[1] "2 c"
+[1] "2 d"
+[1] "2 e"
+[1] "3 a"
+[1] "3 b"
+[1] "3 c"
+[1] "3 d"
+[1] "3 e"
+[1] "4 a"
+[1] "4 b"
+[1] "4 c"
+[1] "4 d"
+[1] "4 e"
+[1] "5 a"
+[1] "5 b"
+[1] "5 c"
+[1] "5 d"
+[1] "5 e"
+
+

We notice in the output that when the first index (i) is +set to 1, the second index (j) iterates through its full +set of indices. Once the indices of j have been iterated +through, then i is incremented. This process continues +until the last index has been used for each for() loop.

+

Rather than printing the results, we could write the loop output to a +new object.

+
+

R +

+
+output_vector <- c()
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    temp_output <- paste(i, j)
+    output_vector <- c(output_vector, temp_output)
+  }
+}
+output_vector
+
+
+

OUTPUT +

+
 [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b"
+[13] "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b" "5 c" "5 d"
+[25] "5 e"
+
+

This approach can be useful, but ‘growing your results’ (building the +result object incrementally) is computationally inefficient, so avoid it +when you are iterating through a lot of values.

+
+
+ +
+
+

Tip: don’t grow your results +

+
+

One of the biggest things that trips up novices and experienced R +users alike, is building a results object (vector, list, matrix, data +frame) as your for loop progresses. Computers are very bad at handling +this, so your calculations can very quickly slow to a crawl. It’s much +better to define an empty results object before hand of appropriate +dimensions, rather than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location.

+
+
+
+

A better way is to define your (empty) output object before filling +in the values. For this example, it looks more involved, but is still +more efficient.

+
+

R +

+
+output_matrix <- matrix(nrow = 5, ncol = 5)
+j_vector <- c('a', 'b', 'c', 'd', 'e')
+for (i in 1:5) {
+  for (j in 1:5) {
+    temp_j_value <- j_vector[j]
+    temp_output <- paste(i, temp_j_value)
+    output_matrix[i, j] <- temp_output
+  }
+}
+output_vector2 <- as.vector(output_matrix)
+output_vector2
+
+
+

OUTPUT +

+
 [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c" "2 c"
+[13] "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e" "3 e" "4 e"
+[25] "5 e"
+
+
+
+ +
+
+

Tip: While loops +

+
+

Sometimes you will find yourself needing to repeat an operation as +long as a certain condition is met. You can do this with a +while() loop.

+
+

R +

+
while(this condition is true){
+  do a thing
+}
+
+

R will interpret a condition being met as “TRUE”.

+

As an example, here’s a while loop that generates random numbers from +a uniform distribution (the runif() function) between 0 and +1 until it gets one that’s less than 0.1.

+
+

R +

+
+z <- 1
+while(z > 0.1){
+  z <- runif(1)
+  cat(z, "\n")
+}
+
+

while() loops will not always be appropriate. You have +to be particularly careful that you don’t end up stuck in an infinite +loop because your condition is always met and hence the while statement +never terminates.

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Compare the objects output_vector and +output_vector2. Are they the same? If not, why not? How +would you change the last block of code to make +output_vector2 the same as output_vector?

+
+
+
+
+
+ +
+
+

We can check whether the two vectors are identical using the +all() function:

+
+

R +

+
+all(output_vector == output_vector2)
+
+

However, all the elements of output_vector can be found +in output_vector2:

+
+

R +

+
+all(output_vector %in% output_vector2)
+
+

and vice versa:

+
+

R +

+
+all(output_vector2 %in% output_vector)
+
+

therefore, the element in output_vector and +output_vector2 are just sorted in a different order. This +is because as.vector() outputs the elements of an input +matrix going over its column. Taking a look at +output_matrix, we can notice that we want its elements by +rows. The solution is to transpose the output_matrix. We +can do it either by calling the transpose function t() or +by inputting the elements in the right order. The first solution +requires to change the original

+
+

R +

+
+output_vector2 <- as.vector(output_matrix)
+
+

into

+
+

R +

+
+output_vector2 <- as.vector(t(output_matrix))
+
+

The second solution requires to change

+
+

R +

+
+output_matrix[i, j] <- temp_output
+
+

into

+
+

R +

+
+output_matrix[j, i] <- temp_output
+
+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Write a script that loops through the gapminder data by +continent and prints out whether the mean life expectancy is smaller or +larger than 50 years.

+
+
+
+
+
+ +
+
+

Step 1: We want to make sure we can extract all the +unique values of the continent vector

+
+

R +

+
+gapminder <- read.csv("data/gapminder_data.csv")
+unique(gapminder$continent)
+
+

Step 2: We also need to loop over each of these +continents and calculate the average life expectancy for each +subset of data. We can do that as follows:

+
  1. Loop over each of the unique values of ‘continent’
  2. +
  3. For each value of continent, create a temporary variable storing +that subset
  4. +
  5. Return the calculated life expectancy to the user by printing the +output:
  6. +
+

R +

+
+for (iContinent in unique(gapminder$continent)) {
+  tmp <- gapminder[gapminder$continent == iContinent, ]
+  cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n")
+  rm(tmp)
+}
+
+

Step 3: The exercise only wants the output printed +if the average life expectancy is less than 50 or greater than 50. So we +need to add an if() condition before printing, which +evaluates whether the calculated average life expectancy is above or +below a threshold, and prints an output conditional on the result. We +need to amend (3) from above:

+

3a. If the calculated life expectancy is less than some threshold (50 +years), return the continent and a statement that life expectancy is +less than threshold, otherwise return the continent and a statement that +life expectancy is greater than threshold:

+
+

R +

+
+thresholdValue <- 50
+
+for (iContinent in unique(gapminder$continent)) {
+   tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"])
+
+   if (tmp < thresholdValue){
+       cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+   } else {
+       cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+   } # end if else condition
+   rm(tmp)
+} # end for loop
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Modify the script from Challenge 3 to loop over each country. This +time print out whether the life expectancy is smaller than 50, between +50 and 70, or greater than 70.

+
+
+
+
+
+ +
+
+

We modify our solution to Challenge 3 by now adding two thresholds, +lowerThreshold and upperThreshold and +extending our if-else statements:

+
+

R +

+
+ lowerThreshold <- 50
+ upperThreshold <- 70
+
+for (iCountry in unique(gapminder$country)) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+    if(tmp < lowerThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
+    } else if(tmp > lowerThreshold && tmp < upperThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
+    } else {
+        cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
+    }
+    rm(tmp)
+}
+
+
+
+
+
+
+
+ +
+
+

Challenge 5 - Advanced +

+
+

Write a script that loops over each country in the +gapminder dataset, tests whether the country starts with a +‘B’, and graphs life expectancy against time as a line graph if the mean +life expectancy is under 50 years.

+
+
+
+
+
+ +
+
+

We will use the grep() command that was introduced in +the Unix +Shell lesson to find countries that start with “B.” Lets understand +how to do this first. Following from the Unix shell section we may be +tempted to try the following

+
+

R +

+
+grep("^B", unique(gapminder$country))
+
+

But when we evaluate this command it returns the indices of the +factor variable country that start with “B.” To get the +values, we must add the value=TRUE option to the +grep() command:

+
+

R +

+
+grep("^B", unique(gapminder$country), value = TRUE)
+
+

We will now store these countries in a variable called +candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each +country, and if the average life expectancy is less than 50 we use +base-plot to plot the evolution of average life expectancy using +with() and subset():

+
+

R +

+
+thresholdValue <- 50
+candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE)
+
+for (iCountry in candidateCountries) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+    if (tmp < thresholdValue) {
+        cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+
+        with(subset(gapminder, country == iCountry),
+                plot(year, lifeExp,
+                     type = "o",
+                     main = paste("Life Expectancy in", iCountry, "over time"),
+                     ylab = "Life Expectancy",
+                     xlab = "Year"
+                     ) # end plot
+             ) # end with
+    } # end if
+    rm(tmp)
+} # end for loop
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/08-plot-ggplot2.html b/08-plot-ggplot2.html new file mode 100644 index 000000000..c9592d5af --- /dev/null +++ b/08-plot-ggplot2.html @@ -0,0 +1,1105 @@ + +R for Reproducible Scientific Analysis: Creating Publication-Quality Graphics with ggplot2 +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Creating Publication-Quality Graphics with ggplot2

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I create publication-quality graphics in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to use ggplot2 to generate publication-quality +graphics.
  • +
  • To apply geometry, aesthetic, and statistics layers to a ggplot +plot.
  • +
  • To manipulate the aesthetics of a plot using different colors, +shapes, and lines.
  • +
  • To improve data visualization through transforming scales and +paneling by group.
  • +
  • To save a plot created with ggplot to disk.
  • +
+
+
+
+
+

Plotting our data is one of the best ways to quickly explore it and +the various relationships between variables.

+

There are three main plotting systems in R, the base plotting +system, the lattice +package, and the ggplot2 +package.

+

Today we’ll be learning about the ggplot2 package, because it is the +most effective for creating publication-quality graphics.

+

ggplot2 is built on the grammar of graphics, the idea that any plot +can be built from the same set of components: a data +set, mapping aesthetics, and graphical +layers:

+
  • Data sets are the data that you, the user, +provide.

  • +
  • Mapping aesthetics are what connect the data to +the graphics. They tell ggplot2 how to use your data to affect how the +graph looks, such as changing what is plotted on the X or Y axis, or the +size or color of different data points.

  • +
  • Layers are the actual graphical output from +ggplot2. Layers determine what kinds of plot are shown (scatterplot, +histogram, etc.), the coordinate system used (rectangular, polar, +others), and other important aspects of the plot. The idea of layers of +graphics may be familiar to you if you have used image editing programs +like Photoshop, Illustrator, or Inkscape.

  • +

Let’s start off building an example using the gapminder data from +earlier. The most basic function is ggplot, which lets R +know that we’re creating a new plot. Any of the arguments we give the +ggplot function are the global options for the +plot: they apply to all layers on the plot.

+
+

R +

+
+library("ggplot2")
+ggplot(data = gapminder)
+
+
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to +show on our figure. This is not enough information for +ggplot to actually draw anything. It only creates a blank +slate for other elements to be added to.

+

Now we’re going to add in the mapping aesthetics +using the aes function. aes tells +ggplot how variables in the data map to +aesthetic properties of the figure, such as which columns of +the data should be used for the x and +y locations.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
+
+
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

Here we told ggplot we want to plot the “gdpPercap” +column of the gapminder data frame on the x-axis, and the “lifeExp” +column on the y-axis. Notice that we didn’t need to explicitly pass +aes these columns +(e.g. x = gapminder[, "gdpPercap"]), this is because +ggplot is smart enough to know to look in the +data for that column!

+

The final part of making our plot is to tell ggplot how +we want to visually represent the data. We do this by adding a new +layer to the plot using one of the +geom functions.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
+
+
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

Here we used geom_point, which tells ggplot +we want to visually represent the relationship between +x and y as a scatterplot of +points.

+
+
+ +
+
+

Challenge 1 +

+
+

Modify the example so that the figure shows how life expectancy has +changed over time:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()
+
+

Hint: the gapminder dataset has a column called “year”, which should +appear on the x-axis.

+
+
+
+
+
+ +
+
+

Here is one possible solution:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
+
+
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

In the previous examples and challenge we’ve used the +aes function to tell the scatterplot geom +about the x and y locations of each +point. Another aesthetic property we can modify is the point +color. Modify the code from the previous challenge to +color the points by the “continent” column. What trends +do you see in the data? Are they what you expected?

+
+
+
+
+
+ +
+
+

The solution presented below adds color=continent to the +call of the aes function. The general trend seems to +indicate an increased life expectancy over the years. On continents with +stronger economies we find a longer life expectancy.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_point()
+
+
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +
+
+
+
+

Layers +

+

Using a scatterplot probably isn’t the best for visualizing change +over time. Instead, let’s tell ggplot to visualize the data +as a line plot:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
+  geom_line()
+
+

Instead of adding a geom_point layer, we’ve added a +geom_line layer.

+

However, the result doesn’t look quite as we might have expected: it +seems to be jumping around a lot in each continent. Let’s try to +separate the data by country, plotting one line for each country:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line()
+
+

We’ve added the group aesthetic, which +tells ggplot to draw a line for each country.

+

But what if we want to visualize both lines and points on the plot? +We can add another layer to the plot:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line() + geom_point()
+
+

It’s important to note that each layer is drawn on top of the +previous layer. In this example, the points have been drawn on top +of the lines. Here’s a demonstration:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+  geom_line(mapping = aes(color=continent)) + geom_point()
+
+

In this example, the aesthetic mapping of +color has been moved from the global plot options in +ggplot to the geom_line layer so it no longer +applies to the points. Now we can clearly see that the points are drawn +on top of the lines.

+
+
+ +
+
+

Tip: Setting an aesthetic to a value instead +of a mapping +

+
+

So far, we’ve seen how to use an aesthetic (such as +color) as a mapping to a variable in the data. +For example, when we use +geom_line(mapping = aes(color=continent)), ggplot will give +a different color to each continent. But what if we want to change the +color of all lines to blue? You may think that +geom_line(mapping = aes(color="blue")) should work, but it +doesn’t. Since we don’t want to create a mapping to a specific variable, +we can move the color specification outside of the aes() +function, like this: geom_line(color="blue").

+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Switch the order of the point and line layers from the previous +example. What happened?

+
+
+
+
+
+ +
+
+

The lines now get drawn over the points!

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_point() + geom_line(mapping = aes(color=continent))
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+
+
+
+

Transformations and statistics +

+

ggplot2 also makes it easy to overlay statistical models over the +data. To demonstrate we’ll go back to our first example:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
+
+

Currently it’s hard to see the relationship between the points due to +some strong outliers in GDP per capita. We can change the scale of units +on the x axis using the scale functions. These control the +mapping between the data values and visual values of an aesthetic. We +can also modify the transparency of the points, using the alpha +function, which is especially helpful when you have a large amount of +data which is very clustered.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10()
+
+
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

The scale_x_log10 function applied a transformation to +the coordinate system of the plot, so that each multiple of 10 is evenly +spaced from left to right. For example, a GDP per capita of 1,000 is the +same horizontal distance away from a value of 10,000 as the 10,000 value +is from 100,000. This helps to visualize the spread of the data along +the x-axis.

+
+
+ +
+
+

Tip Reminder: Setting an aesthetic to a value +instead of a mapping +

+
+

Notice that we used geom_point(alpha = 0.5). As the +previous tip mentioned, using a setting outside of the +aes() function will cause this value to be used for all +points, which is what we want in this case. But just like any other +aesthetic setting, alpha can also be mapped to a variable in +the data. For example, we can give a different transparency to each +continent with +geom_point(mapping = aes(alpha = continent)).

+
+
+
+

We can fit a simple relationship to the data by adding another layer, +geom_smooth:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the +size aesthetic in the geom_smooth +layer:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", size=1.5)
+
+
+

WARNING +

+
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+generated.
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we +set the size aesthetic by passing it as an +argument to geom_smooth. Previously in the lesson we’ve +used the aes function to define a mapping between +data variables and their visual representation.

+
+
+ +
+
+

Challenge 4a +

+
+

Modify the color and size of the points on the point layer in the +previous example.

+

Hint: do not use the aes function.

+
+
+
+
+
+ +
+
+

Here a possible solution: Notice that the color argument +is supplied outside of the aes() function. This means that +it applies to all data points on the graph and is not related to a +specific variable.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(size=3, color="orange") + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+
+
+
+
+
+ +
+
+

Challenge 4b +

+
+

Modify your solution to Challenge 4a so that the points are now a +different shape and are colored by continent with new trendlines. Hint: +The color argument can be used inside the aesthetic.

+
+
+
+
+
+ +
+
+

Here is a possible solution: Notice that supplying the +color argument inside the aes() functions +enables you to connect it to a certain variable. The shape +argument, as you can see, modifies all data points the same way (it is +outside the aes() call) while the color +argument which is placed inside the aes() call modifies a +point’s color based on its continent value.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
+ geom_point(size=3, shape=17) + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
+
+
+

Multi-panel figures +

+

Earlier we visualized the change in life expectancy over time across +all countries in one plot. Alternatively, we can split this out over +multiple panels by adding a layer of facet panels.

+
+
+ +
+
+

Tip +

+
+

We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to clutter +the figure. Note that we apply a “theme” definition to rotate the x-axis +labels to maintain readability. Nearly everything in ggplot2 is +customizable.

+
+
+
+
+

R +

+
+americas <- gapminder[gapminder$continent == "Americas",]
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

The facet_wrap layer took a “formula” as its argument, +denoted by the tilde (~). This tells R to draw a panel for each unique +value in the country column of the gapminder dataset.

+

Modifying text +

+

To clean this figure up for a publication we need to change some of +the text elements. The x-axis is too cluttered, and the y axis should +read “Life expectancy”, rather than the column name in the data +frame.

+

We can do this by adding a couple of different layers. The +theme layer controls the axis text, and overall text +size. Labels for the axes, plot title and any legend can be set using +the labs function. Legend titles are set using the same +names we used in the aes specification. Thus below the +color legend title is set using color = "Continent", while +the title of a fill legend would be set using +fill = "MyTitle".

+
+

R +

+
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+
+

Exporting the plot +

+

The ggsave() function allows you to export a plot +created with ggplot. You can specify the dimension and resolution of +your plot by adjusting the appropriate arguments (width, +height and dpi) to create high quality +graphics for publication. In order to save the plot from above, we first +assign it to a variable lifeExp_plot, then tell +ggsave to save that plot in png format to a +directory called results. (Make sure you have a +results/ folder in your working directory.)

+
+

R +

+
+lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+
+ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")
+
+

There are two nice things about ggsave. First, it +defaults to the last plot, so if you omit the plot argument +it will automatically save the last plot you created with +ggplot. Secondly, it tries to determine the format you want +to save your plot in from the file extension you provide for the +filename (for example .png or .pdf). If you +need to, you can specify the format explicitly in the +device argument.

+

This is a taste of what you can do with ggplot2. RStudio provides a +really useful cheat +sheet of the different layers available, and more extensive +documentation is available on the ggplot2 website. All +RStudio cheat sheets can be found here. Finally, +if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow +with reusable code to modify!

+
+
+ +
+
+

Challenge 5 +

+
+

Generate boxplots to compare life expectancy between the different +continents during the available years.

+

Advanced:

+
  • Rename y axis as Life Expectancy.
  • +
  • Remove x axis labels.
  • +
+
+
+
+
+ +
+
+

Here a possible solution: xlab() and ylab() +set labels for the x and y axes, respectively The axis title, text and +ticks are attributes of the theme and must be modified within a +theme() call.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
+ geom_boxplot() + facet_wrap(~year) +
+ ylab("Life Expectancy") +
+ theme(axis.title.x=element_blank(),
+       axis.text.x = element_blank(),
+       axis.ticks.x = element_blank())
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/09-vectorization.html b/09-vectorization.html new file mode 100644 index 000000000..663ee4ba0 --- /dev/null +++ b/09-vectorization.html @@ -0,0 +1,1020 @@ + +R for Reproducible Scientific Analysis: Vectorization +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Vectorization

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I operate on all the elements of a vector at once?
  • +
+
+
+
+
+
+

Objectives

+
  • To understand vectorized operations in R.
  • +
+
+
+
+
+

Most of R’s functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through and +act on each element one at a time. This makes writing code more concise, +easy to read, and less error prone.

+
+

R +

+
+x <- 1:4
+x * 2
+
+
+

OUTPUT +

+
[1] 2 4 6 8
+
+

The multiplication happened to each element of the vector.

+

We can also add two vectors together:

+
+

R +

+
+y <- 6:9
+x + y
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+

Each element of x was added to its corresponding element +of y:

+
+

R +

+
x:  1  2  3  4
+    +  +  +  +
+y:  6  7  8  9
+---------------
+    7  9 11 13
+
+

Here is how we would add two vectors together using a for loop:

+
+

R +

+
+output_vector <- c()
+for (i in 1:4) {
+  output_vector[i] <- x[i] + y[i]
+}
+output_vector
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+

Compare this to the output using vectorised operations.

+
+

R +

+
+sum_xy <- x + y
+sum_xy
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+
+
+ +
+
+

Challenge 1 +

+
+

Let’s try this on the pop column of the +gapminder dataset.

+

Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+
+
+
+
+
+ +
+
+

Let’s try this on the pop column of the +gapminder dataset.

+

Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+
+

R +

+
+gapminder$pop_millions <- gapminder$pop / 1e6
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap pop_millions
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453     8.425333
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530     9.240934
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007    10.267083
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971    11.537966
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811    13.079460
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134    14.880372
+
+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

On a single graph, plot population, in millions, against year, for +all countries. Do not worry about identifying which country is +which.

+

Repeat the exercise, graphing only for China, India, and Indonesia. +Again, do not worry about which is which.

+
+
+
+
+
+ +
+
+

Refresh your plotting skills by plotting population in millions +against year.

+
+

R +

+
+ggplot(gapminder, aes(x = year, y = pop_millions)) +
+ geom_point()
+
+
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.
+

R +

+
+countryset <- c("China","India","Indonesia")
+ggplot(gapminder[gapminder$country %in% countryset,],
+       aes(x = year, y = pop_millions)) +
+  geom_point()
+
+
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.
+
+
+
+

Comparison operators, logical operators, and many functions are also +vectorized:

+

Comparison operators

+
+

R +

+
+x > 2
+
+
+

OUTPUT +

+
[1] FALSE FALSE  TRUE  TRUE
+
+

Logical operators

+
+

R +

+
+a <- x > 3  # or, for clarity, a <- (x > 3)
+a
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE  TRUE
+
+
+
+ +
+
+

Tip: some useful functions for logical +vectors +

+
+

any() will return TRUE if any +element of a vector is TRUE.
all() will return TRUE if all +elements of a vector are TRUE.

+
+
+
+

Most functions also operate element-wise on vectors:

+

Functions

+
+

R +

+
+x <- 1:4
+log(x)
+
+
+

OUTPUT +

+
[1] 0.0000000 0.6931472 1.0986123 1.3862944
+
+

Vectorized operations work element-wise on matrices:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m * -1
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]   -1   -4   -7  -10
+[2,]   -2   -5   -8  -11
+[3,]   -3   -6   -9  -12
+
+
+
+ +
+
+

Tip: element-wise vs. matrix +multiplication +

+
+

Very important: the operator * gives you element-wise +multiplication! To do matrix multiplication, we need to use the +%*% operator:

+
+

R +

+
+m %*% matrix(1, nrow=4, ncol=1)
+
+
+

OUTPUT +

+
     [,1]
+[1,]   22
+[2,]   26
+[3,]   30
+
+
+

R +

+
+matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)
+
+
+

OUTPUT +

+
     [,1]
+[1,]   30
+
+

For more on matrix algebra, see the Quick-R +reference guide

+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Given the following matrix:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12
+
+

Write down what you think will happen when you run:

+
  1. m ^ -1
  2. +
  3. m * c(1, 0, -1)
  4. +
  5. m > c(0, 20)
  6. +
  7. m * c(1, 0, -1, 2)
  8. +

Did you get the output you expected? If not, ask a helper!

+
+
+
+
+
+ +
+
+

Given the following matrix:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12
+
+

Write down what you think will happen when you run:

+
  1. m ^ -1
  2. +
+

OUTPUT +

+
          [,1]      [,2]      [,3]       [,4]
+[1,] 1.0000000 0.2500000 0.1428571 0.10000000
+[2,] 0.5000000 0.2000000 0.1250000 0.09090909
+[3,] 0.3333333 0.1666667 0.1111111 0.08333333
+
+
  1. m * c(1, 0, -1)
  2. +
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    0    0    0    0
+[3,]   -3   -6   -9  -12
+
+
  1. m > c(0, 20)
  2. +
+

OUTPUT +

+
      [,1]  [,2]  [,3]  [,4]
+[1,]  TRUE FALSE  TRUE FALSE
+[2,] FALSE  TRUE FALSE  TRUE
+[3,]  TRUE FALSE  TRUE FALSE
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

We’re interested in looking at the sum of the following sequence of +fractions:

+
+

R +

+
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+
+

This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000?

+
+
+
+
+
+ +
+
+

We’re interested in looking at the sum of the following sequence of +fractions:

+
+

R +

+
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+
+

This would be tedious to type out, and impossible for high values of +n. Can you use vectorisation to compute x, when n=100? How about when +n=10,000?

+
+

R +

+
+sum(1/(1:100)^2)
+
+
+

OUTPUT +

+
[1] 1.634984
+
+
+

R +

+
+sum(1/(1:1e04)^2)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+

R +

+
+n <- 10000
+sum(1/(1:n)^2)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+

We can also obtain the same results using a function:

+
+

R +

+
+inverse_sum_of_squares <- function(n) {
+  sum(1/(1:n)^2)
+}
+inverse_sum_of_squares(100)
+
+
+

OUTPUT +

+
[1] 1.634984
+
+
+

R +

+
+inverse_sum_of_squares(10000)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+

R +

+
+n <- 10000
+inverse_sum_of_squares(n)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+
+
+
+
+
+ +
+
+

Tip: Operations on vectors of unequal +length +

+
+

Operations can also be performed on vectors of unequal length, +through a process known as recycling. This process +automatically repeats the smaller vector until it matches the length of +the larger vector. R will provide a warning if the larger vector is not +a multiple of the smaller vector.

+
+

R +

+
+x <- c(1, 2, 3)
+y <- c(1, 2, 3, 4, 5, 6, 7)
+x + y
+
+
+

WARNING +

+
Warning in x + y: longer object length is not a multiple of shorter object
+length
+
+
+

OUTPUT +

+
[1] 2 4 6 5 7 9 8
+
+

Vector x was recycled to match the length of vector +y

+
+

R +

+
x:  1  2  3  1  2  3  1
+    +  +  +  +  +  +  +
+y:  1  2  3  4  5  6  7
+-----------------------
+    2  4  6  5  7  9  8
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use vectorized operations instead of loops.
  • +
+
+
+ + + +
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/10-functions.html b/10-functions.html new file mode 100644 index 000000000..fb5c0fc70 --- /dev/null +++ b/10-functions.html @@ -0,0 +1,1221 @@ + +R for Reproducible Scientific Analysis: Functions Explained +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Functions Explained

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I write a new function in R?
  • +
+
+
+
+
+
+

Objectives

+
  • Define a function that takes arguments.
  • +
  • Return a value from a function.
  • +
  • Check argument conditions with stopifnot() in +functions.
  • +
  • Test a function.
  • +
  • Set default values for function arguments.
  • +
  • Explain why we should divide programs into small, single-purpose +functions.
  • +
+
+
+
+
+

If we only had one data set to analyze, it would probably be faster +to load the file into a spreadsheet and use that to plot simple +statistics. However, the gapminder data is updated periodically, and we +may want to pull in that new information later and re-run our analysis +again. We may also obtain similar data from a different source in the +future.

+

In this lesson, we’ll learn how to write a function so that we can +repeat several operations with a single command.

+
+
+ +
+
+

What is a function? +

+
+

Functions gather a sequence of operations into a whole, preserving it +for ongoing use. Functions provide:

+
  • a name we can remember and invoke it by
  • +
  • relief from the need to remember the individual operations
  • +
  • a defined set of inputs and expected outputs
  • +
  • rich connections to the larger programming environment
  • +

As the basic building block of most programming languages, +user-defined functions constitute “programming” as much as any single +abstraction can. If you have written a function, you are a computer +programmer.

+
+
+
+

Defining a function +

+

Let’s open a new R script file in the functions/ +directory and call it functions-lesson.R.

+

The general structure of a function is:

+
+

R +

+
+my_function <- function(parameters) {
+  # perform action
+  # return value
+}
+
+

Let’s define a function fahr_to_kelvin() that converts +temperatures from Fahrenheit to Kelvin:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

We define fahr_to_kelvin() by assigning it to the output +of function. The list of argument names are contained +within parentheses. Next, the body of +the function–the statements that are executed when it runs–is contained +within curly braces ({}). The statements in the body are +indented by two spaces. This makes the code easier to read but does not +affect how the code operates.

+

It is useful to think of creating functions like writing a cookbook. +First you define the “ingredients” that your function needs. In this +case, we only need one ingredient to use our function: “temp”. After we +list our ingredients, we then say what we will do with them, in this +case, we are taking our ingredient and applying a set of mathematical +operators to it.

+

When we call the function, the values we pass to it as arguments are +assigned to those variables so that we can use them inside the function. +Inside the function, we use a return statement to send a +result back to whoever asked for it.

+
+
+ +
+
+

Tip +

+
+

One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the +body of the function. But for clarity, we will explicitly define the +return statement.

+
+
+
+

Let’s try running our function. Calling our own function is no +different from calling any other function:

+
+

R +

+
+# freezing point of water
+fahr_to_kelvin(32)
+
+
+

OUTPUT +

+
[1] 273.15
+
+
+

R +

+
+# boiling point of water
+fahr_to_kelvin(212)
+
+
+

OUTPUT +

+
[1] 373.15
+
+
+
+ +
+
+

Challenge 1 +

+
+

Write a function called kelvin_to_celsius() that takes a +temperature in Kelvin and returns that temperature in Celsius.

+

Hint: To convert from Kelvin to Celsius you subtract 273.15

+
+
+
+
+
+ +
+
+

Write a function called kelvin_to_celsius that takes a +temperature in Kelvin and returns that temperature in Celsius

+
+

R +

+
+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)
+}
+
+
+
+
+
+

Combining functions +

+

The real power of functions comes from mixing, matching and combining +them into ever-larger chunks to get the effect we want.

+

Let’s define two functions that will convert temperature from +Fahrenheit to Kelvin, and Kelvin to Celsius:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+kelvin_to_celsius <- function(temp) {
+  celsius <- temp - 273.15
+  return(celsius)
+}
+
+
+
+ +
+
+

Challenge 2 +

+
+

Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer).

+
+
+
+
+
+ +
+
+

Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above

+
+

R +

+
+fahr_to_celsius <- function(temp) {
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)
+}
+
+
+
+
+
+

Interlude: Defensive Programming +

+

Now that we’ve begun to appreciate how writing functions provides an +efficient way to make R code re-usable and modular, we should note that +it is important to ensure that functions only work in their intended +use-cases. Checking function parameters is related to the concept of +defensive programming. Defensive programming encourages us to +frequently check conditions and throw an error if something is wrong. +These checks are referred to as assertion statements because we want to +assert some condition is TRUE before proceeding. They make +it easier to debug because they give us a better idea of where the +errors originate.

+
+

Checking conditions with stopifnot() +

+

Let’s start by re-examining fahr_to_kelvin(), our +function for converting temperatures from Fahrenheit to Kelvin. It was +defined like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

For this function to work as intended, the argument temp +must be a numeric value; otherwise, the mathematical +procedure for converting between the two temperature scales will not +work. To create an error, we can use the function stop(). +For example, since the argument temp must be a +numeric vector, we could check for this condition with an +if statement and throw an error if the condition was +violated. We could augment our function above like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  if (!is.numeric(temp)) {
+    stop("temp must be a numeric vector.")
+  }
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

If we had multiple conditions or arguments to check, it would take +many lines of code to check all of them. Luckily R provides the +convenience function stopifnot(). We can list as many +requirements that should evaluate to TRUE; +stopifnot() throws an error if it finds one that is +FALSE. Listing these conditions also serves a secondary +purpose as extra documentation for the function.

+

Let’s try out defensive programming with stopifnot() by +adding assertions to check the input to our function +fahr_to_kelvin().

+

We want to assert the following: temp is a numeric +vector. We may do that like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  stopifnot(is.numeric(temp))
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

It still works when given proper input.

+
+

R +

+
+# freezing point of water
+fahr_to_kelvin(temp = 32)
+
+
+

OUTPUT +

+
[1] 273.15
+
+

But fails instantly if given improper input.

+
+

R +

+
+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp = as.factor(32))
+
+
+

ERROR +

+
Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+
+
+
+ +
+
+

Challenge 3 +

+
+

Use defensive programming to ensure that our +fahr_to_celsius() function throws an error immediately if +the argument temp is specified inappropriately.

+
+
+
+
+
+ +
+
+

Extend our previous definition of the function by adding in an +explicit call to stopifnot(). Since +fahr_to_celsius() is a composition of two other functions, +checking inside here makes adding checks to the two component functions +redundant.

+
+

R +

+
+fahr_to_celsius <- function(temp) {
+  stopifnot(is.numeric(temp))
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)
+}
+
+
+
+
+
+
+

More on combining functions +

+

Now, we’re going to define a function that calculates the Gross +Domestic Product of a nation from the data available in our dataset:

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat) {
+  gdp <- dat$pop * dat$gdpPercap
+  return(gdp)
+}
+
+

We define calcGDP() by assigning it to the output of +function. The list of argument names are contained within +parentheses. Next, the body of the function -- the statements executed +when you call the function – is contained within curly braces +({}).

+

We’ve indented the statements in the body by two spaces. This makes +the code easier to read but does not affect how it operates.

+

When we call the function, the values we pass to it are assigned to +the arguments, which become variables inside the body of the +function.

+

Inside the function, we use the return() function to +send back the result. This return() function is optional: R +will automatically return the results of whatever command is executed on +the last line of the function.

+
+

R +

+
+calcGDP(head(gapminder))
+
+
+

OUTPUT +

+
[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231
+
+

That’s not very informative. Let’s add some more arguments so we can +extract that per year and country.

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

If you’ve been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by +using the source() function:

+
+

R +

+
+source("functions/functions-lesson.R")
+
+

Ok, so there’s a lot going on in this function now. In plain English, +the function now subsets the provided data by year if the year argument +isn’t empty, then subsets the result by country if the country argument +isn’t empty. Then it calculates the GDP for whatever subset emerges from +the previous two steps. The function then adds the GDP as a new column +to the subsetted data and returns this as the final result. You can see +that the output is much more informative than a vector of numbers.

+

Let’s take a look at what happens when we specify the year:

+
+

R +

+
+head(calcGDP(gapminder, year=2007))
+
+
+

OUTPUT +

+
       country year      pop continent lifeExp  gdpPercap          gdp
+12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
+24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
+36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
+48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
+60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
+72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894
+
+

Or for a specific country:

+
+

R +

+
+calcGDP(gapminder, country="Australia")
+
+
+

OUTPUT +

+
     country year      pop continent lifeExp gdpPercap          gdp
+61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
+62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
+63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
+64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
+65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
+66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
+67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
+68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
+69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
+70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
+71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894
+
+

Or both:

+
+

R +

+
+calcGDP(gapminder, year=2007, country="Australia")
+
+
+

OUTPUT +

+
     country year      pop continent lifeExp gdpPercap          gdp
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894
+
+

Let’s walk through the body of the function:

+
+

R +

+
calcGDP <- function(dat, year=NULL, country=NULL) {
+
+

Here we’ve added two arguments, year, and +country. We’ve set default arguments for both as +NULL using the = operator in the function +definition. This means that those arguments will take on those values +unless the user specifies otherwise.

+
+

R +

+
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+
+

Here, we check whether each additional argument is set to +null, and whenever they’re not null overwrite +the dataset stored in dat with a subset given by the +non-null argument.

+

Building these conditionals into the function makes it more flexible +for later. Now, we can use it to calculate the GDP for:

+
  • The whole dataset;
  • +
  • A single year;
  • +
  • A single country;
  • +
  • A single combination of year and country.
  • +

By using %in% instead, we can also give multiple years +or countries to those arguments.

+
+
+ +
+
+

Tip: Pass by value +

+
+

Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify dat inside the +function we are modifying the copy of the gapminder dataset stored in +dat, not the original variable we gave as the first +argument.

+

This is called “pass-by-value” and it makes writing code much safer: +you can always be sure that whatever changes you make within the body of +the function, stay inside the body of the function.

+
+
+
+
+
+ +
+
+

Tip: Function scope +

+
+

Another important concept is scoping: any variables (or functions!) +you create or modify inside the body of a function only exist for the +lifetime of the function’s execution. When we call +calcGDP(), the variables dat, gdp +and new only exist inside the body of the function. Even if +we have variables of the same name in our interactive R session, they +are not modified in any way when executing a function.

+
+
+
+
+

R +

+
  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

Finally, we calculated the GDP on our new subset, and created a new +data frame with that column added. This means when we call the function +later we can see the context for the returned GDP values, which is much +better than in our first attempt where we got a vector of numbers.

+
+
+ +
+
+

Challenge 4 +

+
+

Test out your GDP function by calculating the GDP for New Zealand in +1987. How does this differ from New Zealand’s GDP in 1952?

+
+
+
+
+
+ +
+
+
+

R +

+
+  calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand")
+
+

GDP for New Zealand in 1987: 65050008703

+

GDP for New Zealand in 1952: 21058193787

+
+
+
+
+
+
+ +
+
+

Challenge 5 +

+
+

The paste() function can be used to combine text +together, e.g:

+
+

R +

+
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+paste(best_practice, collapse=" ")
+
+
+

OUTPUT +

+
[1] "Write programs for people not computers"
+
+

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:

+
+

R +

+
+fence(text=best_practice, wrapper="***")
+
+

Note: the paste() function has an argument +called sep, which specifies the separator between text. The +default is a space: ” “. The default for paste0() is no +space”“.

+
+
+
+
+
+ +
+
+

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:

+
+

R +

+
+fence <- function(text, wrapper){
+  text <- c(wrapper, text, wrapper)
+  result <- paste(text, collapse = " ")
+  return(result)
+}
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+fence(text=best_practice, wrapper="***")
+
+
+

OUTPUT +

+
[1] "*** Write programs for people not computers ***"
+
+
+
+
+
+
+
+ +
+
+

Tip +

+
+

R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the R +Language Manual or this chapter from Advanced R Programming by Hadley +Wickham.

+
+
+
+
+
+ +
+
+

Tip: Testing and documenting +

+
+

It’s important to both test functions and document them: +Documentation helps you, and others, understand what the purpose of your +function is, and how to use it, and its important to make sure that your +function actually does what you think.

+

When you first start out, your workflow will probably look a lot like +this:

+
  1. Write a function
  2. +
  3. Comment parts of the function to document its behaviour
  4. +
  5. Load in the source file
  6. +
  7. Experiment with it in the console to make sure it behaves as you +expect
  8. +
  9. Make any necessary bug fixes
  10. +
  11. Rinse and repeat.
  12. +

Formal documentation for functions, written in separate +.Rd files, gets turned into the documentation you see in +help files. The roxygen2 +package allows R coders to write documentation alongside the function +code and then process it into the appropriate .Rd files. +You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In +fact, packages are, in essence, bundles of functions with this formal +documentation. Loading your own functions through +source("functions.R") is equivalent to loading someone +else’s functions (or your own one day!) through +library("package").

+

Formal automated tests can be written using the testthat package.

+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/11-writing-data.html b/11-writing-data.html new file mode 100644 index 000000000..0aee86219 --- /dev/null +++ b/11-writing-data.html @@ -0,0 +1,687 @@ + +R for Reproducible Scientific Analysis: Writing Data +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Writing Data

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I save plots and data created in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to write out plots and data from R.
  • +
+
+
+
+
+

Saving plots +

+

You have already seen how to save the most recent plot you create in +ggplot2, using the command ggsave. As a +refresher:

+
+

R +

+
+ggsave("My_most_recent_plot.pdf")
+
+

You can save a plot from within RStudio using the ‘Export’ button in +the ‘Plot’ window. This will give you the option of saving as a .pdf or +as .png, .jpg or other image formats.

+

Sometimes you will want to save plots without creating them in the +‘Plot’ window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you’re looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can’t stop +the loop to click ‘Export’ for each one.

+

In this case you can use a more flexible approach. The function +pdf creates a new pdf device. You can control the size and +resolution using the arguments to this function.

+
+

R +

+
+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
+  geom_line() +
+  theme(legend.position = "none")
+
+# You then have to make sure to turn off the pdf device!
+
+dev.off()
+
+

Open up this document and have a look.

+
+
+ +
+
+

Challenge 1 +

+
+

Rewrite your ‘pdf’ command to print a second page in the pdf, showing +a facet plot (hint: use facet_grid) of the same data with +one panel per continent.

+
+
+
+
+
+ +
+
+
+

R +

+
+pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
+p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
+  geom_line() +
+  theme(legend.position = "none")
+p
+p + facet_grid(~continent)
+dev.off()
+
+
+
+
+
+

The commands jpeg, png etc. are used +similarly to produce documents in different formats.

+

Writing data +

+

At some point, you’ll also want to write out data from R.

+

We can use the write.table function for this, which is +very similar to read.table from before.

+

Let’s create a data-cleaning script, for this analysis, we only want +to focus on the gapminder data for Australia:

+
+

R +

+
+aust_subset <- gapminder[gapminder$country == "Australia",]
+
+write.table(aust_subset,
+  file="cleaned-data/gapminder-aus.csv",
+  sep=","
+)
+
+

Let’s switch back to the shell to take a look at the data to make +sure it looks OK:

+
+

BASH +

+
head cleaned-data/gapminder-aus.csv
+
+
+

OUTPUT +

+
"country","year","pop","continent","lifeExp","gdpPercap"
+"61","Australia",1952,8691212,"Oceania",69.12,10039.59564
+"62","Australia",1957,9712569,"Oceania",70.33,10949.64959
+"63","Australia",1962,10794968,"Oceania",70.93,12217.22686
+"64","Australia",1967,11872264,"Oceania",71.1,14526.12465
+"65","Australia",1972,13177000,"Oceania",71.93,16788.62948
+"66","Australia",1977,14074100,"Oceania",73.49,18334.19751
+"67","Australia",1982,15184200,"Oceania",74.74,19477.00928
+"68","Australia",1987,16257249,"Oceania",76.32,21888.88903
+"69","Australia",1992,17481977,"Oceania",77.56,23424.76683
+
+

Hmm, that’s not quite what we wanted. Where did all these quotation +marks come from? Also the row numbers are meaningless.

+

Let’s look at the help file to work out how to change this +behaviour.

+
+

R +

+
+?write.table
+
+

By default R will wrap character vectors with quotation marks when +writing out to file. It will also write out the row and column +names.

+

Let’s fix this:

+
+

R +

+
+write.table(
+  gapminder[gapminder$country == "Australia",],
+  file="cleaned-data/gapminder-aus.csv",
+  sep=",", quote=FALSE, row.names=FALSE
+)
+
+

Now lets look at the data again using our shell skills:

+
+

BASH +

+
head cleaned-data/gapminder-aus.csv
+
+
+

OUTPUT +

+
country,year,pop,continent,lifeExp,gdpPercap
+Australia,1952,8691212,Oceania,69.12,10039.59564
+Australia,1957,9712569,Oceania,70.33,10949.64959
+Australia,1962,10794968,Oceania,70.93,12217.22686
+Australia,1967,11872264,Oceania,71.1,14526.12465
+Australia,1972,13177000,Oceania,71.93,16788.62948
+Australia,1977,14074100,Oceania,73.49,18334.19751
+Australia,1982,15184200,Oceania,74.74,19477.00928
+Australia,1987,16257249,Oceania,76.32,21888.88903
+Australia,1992,17481977,Oceania,77.56,23424.76683
+
+

That looks better!

+
+
+ +
+
+

Challenge 2 +

+
+

Write a data-cleaning script file that subsets the gapminder data to +include only data points collected since 1990.

+

Use this script to write out the new subset to a file in the +cleaned-data/ directory.

+
+
+
+
+
+ +
+
+
+

R +

+
+write.table(
+  gapminder[gapminder$year > 1990, ],
+  file = "cleaned-data/gapminder-after1990.csv",
+  sep = ",", quote = FALSE, row.names = FALSE
+)
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/12-plyr.html b/12-plyr.html new file mode 100644 index 000000000..7b00811df --- /dev/null +++ b/12-plyr.html @@ -0,0 +1,1011 @@ + +R for Reproducible Scientific Analysis: Splitting and Combining Data Frames with plyr +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Splitting and Combining Data Frames with plyr

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I do different calculations on different sets of data?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to use the split-apply-combine strategy for data +analysis.
  • +
+
+
+
+
+

Previously we looked at how you can use functions to simplify your +code. We defined the calcGDP function, which takes the +gapminder dataset, and multiplies the population and GDP per capita +column. We also defined additional arguments so we could filter by +year and country:

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

A common task you’ll encounter when working with data, is that you’ll +want to run calculations on different groups within the data. In the +above, we were calculating the GDP by multiplying two columns together. +But what if we wanted to calculated the mean GDP per continent?

+

We could run calcGDP and then take the mean of each +continent:

+
+

R +

+
+withGDP <- calcGDP(gapminder)
+mean(withGDP[withGDP$continent == "Africa", "gdp"])
+
+
+

OUTPUT +

+
[1] 20904782844
+
+
+

R +

+
+mean(withGDP[withGDP$continent == "Americas", "gdp"])
+
+
+

OUTPUT +

+
[1] 379262350210
+
+
+

R +

+
+mean(withGDP[withGDP$continent == "Asia", "gdp"])
+
+
+

OUTPUT +

+
[1] 227233738153
+
+

But this isn’t very nice. Yes, by using a function, you have +reduced a substantial amount of repetition. That is +nice. But there is still repetition. Repeating yourself will cost you +time, both now and later, and potentially introduce some nasty bugs.

+

We could write a new function that is flexible like +calcGDP, but this also takes a substantial amount of effort +and testing to get right.

+

The abstract problem we’re encountering here is know as +“split-apply-combine”:

+
Split apply combine

We want to split our data into groups, in this case +continents, apply some calculations on that group, then +optionally combine the results together afterwards.

+

The plyr package +

+

For those of you who have used R before, you might be familiar with +the apply family of functions. While R’s built in functions +do work, we’re going to introduce you to another method for solving the +“split-apply-combine” problem. The plyr package provides a set of +functions that we find more user friendly for solving this problem.

+

We installed this package in an earlier challenge. Let us load it +now:

+
+

R +

+
+library("plyr")
+
+

Plyr has functions for operating on lists, +data.frames and arrays (matrices, or +n-dimensional vectors). Each function performs:

+
  1. A splitting operation
  2. +
  3. +Apply a function on each split in turn.
  4. +
  5. Recombine output data as a single data object.
  6. +

The functions are named based on the data structure they expect as +input, and the data structure you want returned as output: [a]rray, +[l]ist, or [d]ata.frame. The first letter corresponds to the input data +structure, the second letter to the output data structure, and then the +rest of the function is named “ply”.

+

This gives us 9 core functions **ply. There are an additional three +functions which will only perform the split and apply steps, and not any +combine step. They’re named by their input data type and represent null +output by a _ (see table)

+

Note here that plyr’s use of “array” is different to R’s, an array in +ply can include a vector or matrix.

+
Full apply suite

Each of the xxply functions (daply, ddply, +llply, laply, …) has the same structure and +has 4 key features and structure:

+
+

R +

+
+xxply(.data, .variables, .fun)
+
+
  • The first letter of the function name gives the input type and the +second gives the output type.
  • +
  • .data - gives the data object to be processed
  • +
  • .variables - identifies the splitting variables
  • +
  • .fun - gives the function to be called on each piece
  • +

Now we can quickly calculate the mean GDP per continent:

+
+

R +

+
+ddply(
+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
  continent           V1
+1    Africa  20904782844
+2  Americas 379262350210
+3      Asia 227233738153
+4    Europe 269442085301
+5   Oceania 188187105354
+
+

Let us walk through the previous code:

+
  • The ddply function feeds in a data.frame +(function starts with d) and returns another +data.frame (2nd letter is a d)
  • +
  • the first argument we gave was the data.frame we wanted to operate +on: in this case the gapminder data. We called calcGDP on +it first so that it would have the additional gdp column +added to it.
  • +
  • The second argument indicated our split criteria: in this case the +“continent” column. Note that we gave the name of the column, not the +values of the column like we had done previously with subsetting. Plyr +takes care of these implementation details for you.
  • +
  • The third argument is the function we want to apply to each grouping +of the data. We had to define our own short function here: each subset +of the data gets stored in x, the first argument of our +function. This is an anonymous function: we haven’t defined it +elsewhere, and it has no name. It only exists in the scope of our call +to ddply.
  • +
+
+ +
+
+

Challenge 1 +

+
+

Calculate the average life expectancy per continent. Which has the +longest? Which has the shortest?

+
+
+
+
+
+ +
+
+
+

R +

+
+ddply(
+ .data = gapminder,
+ .variables = "continent",
+ .fun = function(x) mean(x$lifeExp)
+)
+
+

Oceania has the longest and Africa the shortest.

+
+
+
+
+

What if we want a different type of output data structure?:

+
+

R +

+
+dlply(
+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
$Africa
+[1] 20904782844
+
+$Americas
+[1] 379262350210
+
+$Asia
+[1] 227233738153
+
+$Europe
+[1] 269442085301
+
+$Oceania
+[1] 188187105354
+
+attr(,"split_type")
+[1] "data.frame"
+attr(,"split_labels")
+  continent
+1    Africa
+2  Americas
+3      Asia
+4    Europe
+5   Oceania
+
+

We called the same function again, but changed the second letter to +an l, so the output was returned as a list.

+

We can specify multiple columns to group by:

+
+

R +

+
+ddply(
+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
   continent year           V1
+1     Africa 1952   5992294608
+2     Africa 1957   7359188796
+3     Africa 1962   8784876958
+4     Africa 1967  11443994101
+5     Africa 1972  15072241974
+6     Africa 1977  18694898732
+7     Africa 1982  22040401045
+8     Africa 1987  24107264108
+9     Africa 1992  26256977719
+10    Africa 1997  30023173824
+11    Africa 2002  35303511424
+12    Africa 2007  45778570846
+13  Americas 1952 117738997171
+14  Americas 1957 140817061264
+15  Americas 1962 169153069442
+16  Americas 1967 217867530844
+17  Americas 1972 268159178814
+18  Americas 1977 324085389022
+19  Americas 1982 363314008350
+20  Americas 1987 439447790357
+21  Americas 1992 489899820623
+22  Americas 1997 582693307146
+23  Americas 2002 661248623419
+24  Americas 2007 776723426068
+25      Asia 1952  34095762661
+26      Asia 1957  47267432088
+27      Asia 1962  60136869012
+28      Asia 1967  84648519224
+29      Asia 1972 124385747313
+30      Asia 1977 159802590186
+31      Asia 1982 194429049919
+32      Asia 1987 241784763369
+33      Asia 1992 307100497486
+34      Asia 1997 387597655323
+35      Asia 2002 458042336179
+36      Asia 2007 627513635079
+37    Europe 1952  84971341466
+38    Europe 1957 109989505140
+39    Europe 1962 138984693095
+40    Europe 1967 173366641137
+41    Europe 1972 218691462733
+42    Europe 1977 255367522034
+43    Europe 1982 279484077072
+44    Europe 1987 316507473546
+45    Europe 1992 342703247405
+46    Europe 1997 383606933833
+47    Europe 2002 436448815097
+48    Europe 2007 493183311052
+49   Oceania 1952  54157223944
+50   Oceania 1957  66826828013
+51   Oceania 1962  82336453245
+52   Oceania 1967 105958863585
+53   Oceania 1972 134112109227
+54   Oceania 1977 154707711162
+55   Oceania 1982 176177151380
+56   Oceania 1987 209451563998
+57   Oceania 1992 236319179826
+58   Oceania 1997 289304255183
+59   Oceania 2002 345236880176
+60   Oceania 2007 403657044512
+
+
+

R +

+
+daply(
+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
          year
+continent          1952         1957         1962         1967         1972
+  Africa     5992294608   7359188796   8784876958  11443994101  15072241974
+  Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+  Asia      34095762661  47267432088  60136869012  84648519224 124385747313
+  Europe    84971341466 109989505140 138984693095 173366641137 218691462733
+  Oceania   54157223944  66826828013  82336453245 105958863585 134112109227
+          year
+continent          1977         1982         1987         1992         1997
+  Africa    18694898732  22040401045  24107264108  26256977719  30023173824
+  Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+  Asia     159802590186 194429049919 241784763369 307100497486 387597655323
+  Europe   255367522034 279484077072 316507473546 342703247405 383606933833
+  Oceania  154707711162 176177151380 209451563998 236319179826 289304255183
+          year
+continent          2002         2007
+  Africa    35303511424  45778570846
+  Americas 661248623419 776723426068
+  Asia     458042336179 627513635079
+  Europe   436448815097 493183311052
+  Oceania  345236880176 403657044512
+
+

You can use these functions in place of for loops (and +it is usually faster to do so). To replace a for loop, put the code that +was in the body of the for loop inside an anonymous +function.

+
+

R +

+
+d_ply(
+  .data=gapminder,
+  .variables = "continent",
+  .fun = function(x) {
+    meanGDPperCap <- mean(x$gdpPercap)
+    print(paste(
+      "The mean GDP per capita for", unique(x$continent),
+      "is", format(meanGDPperCap, big.mark=",")
+   ))
+  }
+)
+
+
+

OUTPUT +

+
[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+
+
+
+ +
+
+

Tip: printing numbers +

+
+

The format function can be used to make numeric values +“pretty” for printing out in messages.

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Calculate the average life expectancy per continent and year. Which +had the longest and shortest in 2007? Which had the greatest change in +between 1952 and 2007?

+
+
+
+
+
+ +
+
+
+

R +

+
+solution <- ddply(
+ .data = gapminder,
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$lifeExp)
+)
+solution_2007 <- solution[solution$year == 2007, ]
+solution_2007
+
+

Oceania had the longest average life expectancy in 2007 and Africa +the lowest.

+
+

R +

+
+solution_1952_2007 <- cbind(solution[solution$year == 1952, ], solution_2007)
+difference_1952_2007 <- data.frame(continent = solution_1952_2007$continent,
+                                   year_1957 = solution_1952_2007[[3]],
+                                   year_2007 = solution_1952_2007[[6]],
+                                   difference = solution_1952_2007[[6]] - solution_1952_2007[[3]])
+difference_1952_2007
+
+

Asia had the greatest difference, and Oceania the least.

+
+
+
+
+
+
+ +
+
+

Alternate Challenge +

+
+

Without running them, which of the following will calculate the +average life expectancy per continent:

+
  1. +
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = gapminder$continent,
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
  1. +
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = mean(dataGroup$lifeExp)
+)
+
+
  1. +
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
  1. +
+

R +

+
+adply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
+
+
+
+
+ +
+
+

Answer 3 will calculate the average life expectancy per +continent.

+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/13-dplyr.html b/13-dplyr.html new file mode 100644 index 000000000..cdef25e5a --- /dev/null +++ b/13-dplyr.html @@ -0,0 +1,1239 @@ + +R for Reproducible Scientific Analysis: Data Frame Manipulation with dplyr +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Data Frame Manipulation with dplyr

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I manipulate data frames without repeating myself?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to use the six main data frame manipulation ‘verbs’ with +pipes in dplyr.
  • +
  • To understand how group_by() and +summarize() can be combined to summarize datasets.
  • +
  • Be able to analyze a subset of data using logical filtering.
  • +
+
+
+
+
+

Manipulation of data frames means many things to many researchers: we +often select certain observations (rows) or variables (columns), we +often group the data by a certain variable(s), or we even calculate +summary statistics. We can do these operations using the normal base R +operations:

+
+

R +

+
+mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 2193.755
+
+
+

R +

+
+mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 7136.11
+
+
+

R +

+
+mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 7902.15
+
+

But this isn’t very nice because there is a fair bit of +repetition. Repeating yourself will cost you time, both now and later, +and potentially introduce some nasty bugs.

+

The dplyr package +

+

Luckily, the dplyr +package provides a number of very useful functions for manipulating data +frames in a way that will reduce the above repetition, reduce the +probability of making errors, and probably even save you some typing. As +an added bonus, you might even find the dplyr grammar +easier to read.

+
+
+ +
+
+

Tip: Tidyverse +

+
+

dplyr package belongs to a broader family of opinionated +R packages designed for data science called the “Tidyverse”. These +packages are specifically designed to work harmoniously together. Some +of these packages will be covered along this course, but you can find +more complete information here: https://www.tidyverse.org/.

+
+
+
+

Here we’re going to cover 5 of the most commonly used functions as +well as using pipes (%>%) to combine them.

+
  1. select()
  2. +
  3. filter()
  4. +
  5. group_by()
  6. +
  7. summarize()
  8. +
  9. mutate()
  10. +

If you have have not installed this package earlier, please do +so:

+
+

R +

+
+install.packages('dplyr')
+
+

Now let’s load the package:

+
+

R +

+
+library("dplyr")
+
+

Using select() +

+

If, for example, we wanted to move forward with only a few of the +variables in our data frame we could use the select() +function. This will keep only the variables you select.

+
+

R +

+
+year_country_gdp <- select(gapminder, year, country, gdpPercap)
+
+

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.

+
+

R +

+
+smaller_gapminder_data <- select(gapminder, -continent)
+
+

If we open up year_country_gdp we’ll see that it only +contains the year, country and gdpPercap. Above we used ‘normal’ +grammar, but the strengths of dplyr lie in combining +several functions using pipes. Since the pipes grammar is unlike +anything we’ve seen in R before, let’s repeat what we’ve done above +using pipes.

+
+

R +

+
+year_country_gdp <- gapminder %>% select(year, country, gdpPercap)
+
+

To help you understand why we wrote that in that way, let’s walk +through it step by step. First we summon the gapminder data frame and +pass it on, using the pipe symbol %>%, to the next step, +which is the select() function. In this case we don’t +specify which data object we use in the select() function +since in gets that from the previous pipe. Fun Fact: +There is a good chance you have encountered pipes before in the shell. +In R, a pipe symbol is %>% while in the shell it is +| but the concept is the same!

+
+
+ +
+
+

Tip: Renaming data frame columns in dplyr +

+
+

In Chapter 4 we covered how you can rename columns with base R by +assigning a value to the output of the names() function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a +rename() function.

+

Within a pipeline, the syntax is +rename(new_name = old_name). For example, we may want to +rename the gdpPercap column name from our select() +statement above.

+
+

R +

+
+tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)
+
+head(tidy_gdp)
+
+
+

OUTPUT +

+
  year     country gdp_per_capita
+1 1952 Afghanistan       779.4453
+2 1957 Afghanistan       820.8530
+3 1962 Afghanistan       853.1007
+4 1967 Afghanistan       836.1971
+5 1972 Afghanistan       739.9811
+6 1977 Afghanistan       786.1134
+
+
+
+
+

Using filter() +

+

If we now want to move forward with the above, but only with European +countries, we can combine select and +filter

+
+

R +

+
+year_country_gdp_euro <- gapminder %>%
+    filter(continent == "Europe") %>%
+    select(year, country, gdpPercap)
+
+

If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below.

+
+

R +

+
+europe_lifeExp_2007 <- gapminder %>%
+  filter(continent == "Europe", year == 2007) %>%
+  select(country, lifeExp)
+
+
+
+ +
+
+

Challenge 1 +

+
+

Write a single command (which can span multiple lines and includes +pipes) that will produce a data frame that has the African values for +lifeExp, country and year, but +not for other Continents. How many rows does your data frame have and +why?

+
+
+
+
+
+ +
+
+
+

R +

+
+year_country_lifeExp_Africa <- gapminder %>%
+                           filter(continent == "Africa") %>%
+                           select(year, country, lifeExp)
+
+
+
+
+
+

As with last time, first we pass the gapminder data frame to the +filter() function, then we pass the filtered version of the +gapminder data frame to the select() function. +Note: The order of operations is very important in this +case. If we used ‘select’ first, filter would not be able to find the +variable continent since we would have removed it in the previous +step.

+

Using group_by() +

+

Now, we were supposed to be reducing the error prone repetitiveness +of what can be done with base R, but up to now we haven’t done that +since we would have to repeat the above for each continent. Instead of +filter(), which will only pass observations that meet your +criteria (in the above: continent=="Europe"), we can use +group_by(), which will essentially use every unique +criteria that you could have used in filter.

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+
+

R +

+
+str(gapminder %>% group_by(continent))
+
+
+

OUTPUT +

+
gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
+  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
+  ..$ .rows    : list<int> [1:5] 
+  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
+  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
+  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
+  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
+  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
+  .. ..@ ptype: int(0) 
+  ..- attr(*, ".drop")= logi TRUE
+
+

You will notice that the structure of the data frame where we used +group_by() (grouped_df) is not the same as the +original gapminder (data.frame). A +grouped_df can be thought of as a list where +each item in the listis a data.frame which +contains only the rows that correspond to the a particular value +continent (at least in the example above).

+
Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize() +

+

The above was a bit on the uneventful side but +group_by() is much more exciting in conjunction with +summarize(). This will allow us to create new variable(s) +by using functions that repeat for each of the continent-specific data +frames. That is to say, using the group_by() function, we +split our original data frame into multiple pieces, then we can run +functions (e.g. mean() or sd()) within +summarize().

+
+

R +

+
+gdp_bycontinents <- gapminder %>%
+    group_by(continent) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
+
+
Diagram illustrating the use of group by and summarize together to create a new variable
+

R +

+
continent mean_gdpPercap
+     <fctr>          <dbl>
+1    Africa       2193.755
+2  Americas       7136.110
+3      Asia       7902.150
+4    Europe      14469.476
+5   Oceania      18621.609
+
+

That allowed us to calculate the mean gdpPercap for each continent, +but it gets even better.

+
+
+ +
+
+

Challenge 2 +

+
+

Calculate the average life expectancy per country. Which has the +longest average life expectancy and which has the shortest average life +expectancy?

+
+
+
+
+
+ +
+
+
+

R +

+
+lifeExp_bycountry <- gapminder %>%
+   group_by(country) %>%
+   summarize(mean_lifeExp = mean(lifeExp))
+lifeExp_bycountry %>%
+   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
+
+
+

OUTPUT +

+
# A tibble: 2 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Iceland              76.5
+2 Sierra Leone         36.8
+
+

Another way to do this is to use the dplyr function +arrange(), which arranges the rows in a data frame +according to the order of one or more variables from the data frame. It +has similar syntax to other functions from the dplyr +package. You can use desc() inside arrange() +to sort in descending order.

+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(mean_lifeExp) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Sierra Leone         36.8
+
+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(desc(mean_lifeExp)) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country mean_lifeExp
+  <chr>          <dbl>
+1 Iceland         76.5
+
+

Alphabetical order works too

+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(desc(country)) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country  mean_lifeExp
+  <chr>           <dbl>
+1 Zimbabwe         52.7
+
+
+
+
+
+

The function group_by() allows us to group by multiple +variables. Let’s group by year and +continent.

+
+

R +

+
+gdp_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

That is already quite powerful, but it gets even better! You’re not +limited to defining 1 new variable in summarize().

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

count() and n() +

+

A very common operation is to count the number of observations for +each group. The dplyr package comes with two related +functions that help with this.

+

For instance, if we wanted to check the number of countries included +in the dataset for the year 2002, we can use the count() +function. It takes the name of one or more columns that contain the +groups we are interested in, and we can optionally sort the results in +descending order by adding sort=TRUE:

+
+

R +

+
+gapminder %>%
+    filter(year == 2002) %>%
+    count(continent, sort = TRUE)
+
+
+

OUTPUT +

+
  continent  n
+1    Africa 52
+2      Asia 33
+3    Europe 30
+4  Americas 25
+5   Oceania  2
+
+

If we need to use the number of observations in calculations, the +n() function is useful. It will return the total number of +observations in the current group rather than counting the number of +observations in each group within a specific column. For instance, if we +wanted to get the standard error of the life expectency per +continent:

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    summarize(se_le = sd(lifeExp)/sqrt(n()))
+
+
+

OUTPUT +

+
# A tibble: 5 × 2
+  continent se_le
+  <chr>     <dbl>
+1 Africa    0.366
+2 Americas  0.540
+3 Asia      0.596
+4 Europe    0.286
+5 Oceania   0.775
+
+

You can also chain together several summary operations; in this case +calculating the minimum, maximum, +mean and se of each continent’s per-country +life-expectancy:

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    summarize(
+      mean_le = mean(lifeExp),
+      min_le = min(lifeExp),
+      max_le = max(lifeExp),
+      se_le = sd(lifeExp)/sqrt(n()))
+
+
+

OUTPUT +

+
# A tibble: 5 × 5
+  continent mean_le min_le max_le se_le
+  <chr>       <dbl>  <dbl>  <dbl> <dbl>
+1 Africa       48.9   23.6   76.4 0.366
+2 Americas     64.7   37.6   80.7 0.540
+3 Asia         60.1   28.8   82.6 0.596
+4 Europe       71.9   43.6   81.8 0.286
+5 Oceania      74.3   69.1   81.2 0.775
+
+

Using mutate() +

+

We can also create new variables prior to (or even after) summarizing +information using mutate().

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
+    group_by(continent,year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

Connect mutate with logical filtering: ifelse +

+

When creating new variables, we can hook this with a logical +condition. A simple combination of mutate() and +ifelse() facilitates filtering right where it is needed: in +the moment of creating something new. This easy-to-read statement is a +fast and powerful way of discarding certain data (even though the +overall dimension of the data frame will not change) or for updating +values depending on this given condition.

+
+

R +

+
+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
+    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+

R +

+
+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
+    mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              mean_gdpPercap_expected = mean(gdp_futureExpectation))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

Combining dplyr and ggplot2 +

+

First install and load ggplot2:

+
+

R +

+
+install.packages('ggplot2')
+
+
+

R +

+
+library("ggplot2")
+
+

In the plotting lesson we looked at how to make a multi-panel figure +by adding a layer of facet panels using ggplot2. Here is +the code we used (with some extra comments):

+
+

R +

+
+# Filter countries located in the Americas
+americas <- gapminder[gapminder$continent == "Americas", ]
+# Make the plot
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

This code makes the right plot but it also creates an intermediate +variable (americas) that we might not have any other uses +for. Just as we used %>% to pipe data along a chain of +dplyr functions we can use it to pass data to +ggplot(). Because %>% replaces the first +argument in a function we don’t need to specify the data = +argument in the ggplot() function. By combining +dplyr and ggplot2 functions we can make the +same figure without creating any new variables or modifying the +data.

+
+

R +

+
+gapminder %>%
+  # Filter countries located in the Americas
+  filter(continent == "Americas") %>%
+  # Make the plot
+  ggplot(mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

More examples of using the function mutate() and the +ggplot2 package.

+
+

R +

+
+gapminder %>%
+  # extract first letter of country name into new column
+  mutate(startsWith = substr(country, 1, 1)) %>%
+  # only keep countries starting with A or Z
+  filter(startsWith %in% c("A", "Z")) %>%
+  # plot lifeExp into facets
+  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
+  geom_line() +
+  facet_wrap(vars(country)) +
+  theme_minimal()
+
+
+
+ +
+
+

Advanced Challenge +

+
+

Calculate the average life expectancy in 2002 of 2 randomly selected +countries for each continent. Then arrange the continent names in +reverse order. Hint: Use the dplyr +functions arrange() and sample_n(), they have +similar syntax to other dplyr functions.

+
+
+
+
+
+ +
+
+
+

R +

+
+lifeExp_2countries_bycontinents <- gapminder %>%
+   filter(year==2002) %>%
+   group_by(continent) %>%
+   sample_n(2) %>%
+   summarize(mean_lifeExp=mean(lifeExp)) %>%
+   arrange(desc(mean_lifeExp))
+
+
+
+
+
+

Other great resources +

+
+
+ +
+
+

Keypoints +

+
+
  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/14-tidyr.html b/14-tidyr.html new file mode 100644 index 000000000..74127b3b2 --- /dev/null +++ b/14-tidyr.html @@ -0,0 +1,1160 @@ + +R for Reproducible Scientific Analysis: Data Frame Manipulation with tidyr +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Data Frame Manipulation with tidyr

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I change the layout of a data frame?
  • +
+
+
+
+
+
+

Objectives

+
  • To understand the concepts of ‘longer’ and ‘wider’ data frame +formats and be able to convert between them with +tidyr.
  • +
+
+
+
+
+

Researchers often want to reshape their data frames from ‘wide’ to +‘longer’ layouts, or vice-versa. The ‘long’ layout or format is +where:

+
  • each column is a variable
  • +
  • each row is an observation
  • +

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column +for the observed variable and the other columns are ID variables.

+

For the ‘wide’ format each row is often a site/subject/patient and +you have multiple observation variables containing the same type of +data. These can be either repeated observations over time, or +observation of multiple variables (or a mix of both). You may find data +input may be simpler or some other applications may prefer the ‘wide’ +format. However, many of R‘s functions have been designed +assuming you have ’longer’ formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format.

+
Diagram illustrating the difference between a wide versus long layout of a data frame

Long and wide data frame layouts mainly affect readability. For +humans, the wide format is often more intuitive since we can often see +more of the data on the screen due to its shape. However, the long +format is more machine readable and is closer to the formatting of +databases. The ID variables in our data frames are similar to the fields +in a database and observed variables are like the database values.

+

Getting started +

+

First install the packages if you haven’t already done so (you +probably installed dplyr in the previous lesson):

+
+

R +

+
+#install.packages("tidyr")
+#install.packages("dplyr")
+
+

Load the packages

+
+

R +

+
+library("tidyr")
+library("dplyr")
+
+

First, lets look at the structure of our original gapminder data +frame:

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+
+
+ +
+
+

Challenge 1 +

+
+

Is gapminder a purely long, purely wide, or some intermediate +format?

+
+
+
+
+
+ +
+
+

The original gapminder data.frame is in an intermediate format. It is +not purely long since it had multiple observation variables +(pop,lifeExp,gdpPercap).

+
+
+
+
+

Sometimes, as with the gapminder dataset, we have multiple types of +observed data. It is somewhere in between the purely ‘long’ and ‘wide’ +data formats. We have 3 “ID variables” (continent, +country, year) and 3 “Observation variables” +(pop,lifeExp,gdpPercap). This +intermediate format can be preferred despite not having ALL observations +in 1 column given that all 3 observation variables have different units. +There are few operations that would need us to make this data frame any +longer (i.e. 4 ID variables and 1 Observation variable).

+

While using many of the functions in R, which are often vector based, +you usually do not want to do mathematical operations on values with +different units. For example, using the purely long format, a single +mean for all of the values of population, life expectancy, and GDP would +not be meaningful since it would return the mean of values with 3 +incompatible units. The solution is that we first manipulate the data +either by grouping (see the lesson on dplyr), or we change +the structure of the data frame. Note: Some plotting +functions in R actually work better in the wide format data.

+

From wide to long format with pivot_longer() +

+

Until now, we’ve been using the nicely formatted original gapminder +dataset, but ‘real’ data (i.e. our own research data) will never be so +well organized. Here let’s start with the wide formatted version of the +gapminder dataset.

+
+

Download the wide version of the gapminder data from here and save it in your data +folder.

+
+

We’ll load the data file and look at it. Note: we don’t want our +continent and country columns to be factors, so we use the +stringsAsFactors argument for read.csv() to disable +that.

+
+

R +

+
+gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
+str(gap_wide)
+
+
+

OUTPUT +

+
'data.frame':	142 obs. of  38 variables:
+ $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num  3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num  2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
+ $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
+
+
Diagram illustrating the wide format of the gapminder data frame

To change this very wide data frame layout back to our nice, +intermediate (or longer) layout, we will use one of the two available +pivot functions from the tidyr package. To +convert from wide to a longer format, we will use the +pivot_longer() function. pivot_longer() makes +datasets longer by increasing the number of rows and decreasing the +number of columns, or ‘lengthening’ your observation variables into a +single variable.

+
Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format
+

R +

+
+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+    names_to = "obstype_year", values_to = "obs_values"
+  )
+str(gap_long)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
+ $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...
+
+

Here we have used piping syntax which is similar to what we were +doing in the previous lesson with dplyr. In fact, these are compatible +and you can use a mix of tidyr and dplyr functions by piping them +together.

+

We first provide to pivot_longer() a vector of column +names that will be pivoted into longer format. We could type out all the +observation variables, but as in the select() function (see +dplyr lesson), we can use the starts_with() +argument to select all variables that start with the desired character +string. pivot_longer() also allows the alternative syntax +of using the - symbol to identify which variables are not +to be pivoted (i.e. ID variables).

+

The next arguments to pivot_longer() are +names_to for naming the column that will contain the new ID +variable (obstype_year) and values_to for +naming the new amalgamated observation variable +(obs_value). We supply these new column names as +strings.

+
Diagram illustrating the long format of the gapminder data
+

R +

+
+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(-continent, -country),
+    names_to = "obstype_year", values_to = "obs_values"
+  )
+str(gap_long)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values  : num [1:5112] 2449 3014 2551 3247 4183 ...
+
+

That may seem trivial with this particular data frame, but sometimes +you have 1 ID variable and 40 observation variables with irregular +variable names. The flexibility is a huge time saver!

+

Now obstype_year actually contains 2 pieces of +information, the observation type +(pop,lifeExp, or gdpPercap) and +the year. We can use the separate() function +to split the character strings into multiple variables

+
+

R +

+
+gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
+gap_long$year <- as.integer(gap_long$year)
+
+
+
+ +
+
+

Challenge 2 +

+
+

Using gap_long, calculate the mean life expectancy, +population, and gdpPercap for each continent. Hint: use +the group_by() and summarize() functions we +learned in the dplyr lesson

+
+
+
+
+
+ +
+
+
+

R +

+
+gap_long %>% group_by(continent, obs_type) %>%
+   summarize(means=mean(obs_values))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+

OUTPUT +

+
# A tibble: 15 × 3
+# Groups:   continent [5]
+   continent obs_type       means
+   <chr>     <chr>          <dbl>
+ 1 Africa    gdpPercap     2194. 
+ 2 Africa    lifeExp         48.9
+ 3 Africa    pop        9916003. 
+ 4 Americas  gdpPercap     7136. 
+ 5 Americas  lifeExp         64.7
+ 6 Americas  pop       24504795. 
+ 7 Asia      gdpPercap     7902. 
+ 8 Asia      lifeExp         60.1
+ 9 Asia      pop       77038722. 
+10 Europe    gdpPercap    14469. 
+11 Europe    lifeExp         71.9
+12 Europe    pop       17169765. 
+13 Oceania   gdpPercap    18622. 
+14 Oceania   lifeExp         74.3
+15 Oceania   pop        8874672. 
+
+
+
+
+
+

From long to intermediate format with pivot_wider() +

+

It is always good to check work. So, let’s use the second +pivot function, pivot_wider(), to ‘widen’ our +observation variables back out. pivot_wider() is the +opposite of pivot_longer(), making a dataset wider by +increasing the number of columns and decreasing the number of rows. We +can use pivot_wider() to pivot or reshape our +gap_long to the original intermediate format or the widest +format. Let’s start with the intermediate format.

+

The pivot_wider() function takes names_from +and values_from arguments.

+

To names_from we supply the column name whose contents +will be pivoted into new output columns in the widened data frame. The +corresponding values will be added from the column named in the +values_from argument.

+
+

R +

+
+gap_normal <- gap_long %>%
+  pivot_wider(names_from = obs_type, values_from = obs_values)
+dim(gap_normal)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+
+

R +

+
+dim(gapminder)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+
+

R +

+
+names(gap_normal)
+
+
+

OUTPUT +

+
[1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"      
+
+
+

R +

+
+names(gapminder)
+
+
+

OUTPUT +

+
[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
+
+

Now we’ve got an intermediate data frame gap_normal with +the same dimensions as the original gapminder, but the +order of the variables is different. Let’s fix that before checking if +they are all.equal().

+
+

R +

+
+gap_normal <- gap_normal[, names(gapminder)]
+all.equal(gap_normal, gapminder)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+[3] "Component \"country\": 1704 string mismatches"                                         
+[4] "Component \"pop\": Mean relative difference: 1.634504"                                 
+[5] "Component \"continent\": 1212 string mismatches"                                       
+[6] "Component \"lifeExp\": Mean relative difference: 0.203822"                             
+[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"                           
+
+
+

R +

+
+head(gap_normal)
+
+
+

OUTPUT +

+
# A tibble: 6 × 6
+  country  year      pop continent lifeExp gdpPercap
+  <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>
+1 Algeria  1952  9279525 Africa       43.1     2449.
+2 Algeria  1957 10270856 Africa       45.7     3014.
+3 Algeria  1962 11000948 Africa       48.3     2551.
+4 Algeria  1967 12760499 Africa       51.4     3247.
+5 Algeria  1972 14760787 Africa       54.5     4183.
+6 Algeria  1977 17152804 Africa       58.0     4910.
+
+
+

R +

+
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+
+

We’re almost there, the original was sorted by country, +then year.

+
+

R +

+
+gap_normal <- gap_normal %>% arrange(country, year)
+all.equal(gap_normal, gapminder)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+
+

That’s great! We’ve gone from the longest format back to the +intermediate and we didn’t introduce any errors in our code.

+

Now let’s convert the long all the way back to the wide. In the wide +format, we will keep country and continent as ID variables and pivot the +observations across the 3 metrics +(pop,lifeExp,gdpPercap) and time +(year). First we need to create appropriate labels for all +our new variables (time*metric combinations) and we also need to unify +our ID variables to simplify the process of defining +gap_wide.

+
+

R +

+
+gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")
+str(gap_temp)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ var_ID    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ obs_type  : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
+ $ year      : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+
+
+

R +

+
+gap_temp <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_")
+str(gap_temp)
+
+
+

OUTPUT +

+
tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
+ $ ID_var    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+
+

Using unite() we now have a single ID variable which is +a combination of continent,country,and we have +defined variable names. We’re now ready to pipe in +pivot_wider()

+
+

R +

+
+gap_wide_new <- gap_long %>%
+  unite(ID_var, continent, country, sep = "_") %>%
+  unite(var_names, obs_type, year, sep = "_") %>%
+  pivot_wider(names_from = var_names, values_from = obs_values)
+str(gap_wide_new)
+
+
+

OUTPUT +

+
tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
+ $ ID_var        : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+
+
+
+ +
+
+

Challenge 3 +

+
+

Take this 1 step further and create a +gap_ludicrously_wide format data by pivoting over +countries, year and the 3 metrics? Hint this new data +frame should only have 5 rows.

+
+
+
+
+
+ +
+
+
+

R +

+
+gap_ludicrously_wide <- gap_long %>%
+   unite(var_names, obs_type, year, country, sep = "_") %>%
+   pivot_wider(names_from = var_names, values_from = obs_values)
+
+
+
+
+
+

Now we have a great ‘wide’ format data frame, but the +ID_var could be more usable, let’s separate it into 2 +variables with separate()

+
+

R +

+
+gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
+gap_wide_betterID <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_") %>%
+    pivot_wider(names_from = var_names, values_from = obs_values) %>%
+    separate(ID_var, c("continent","country"), sep = "_")
+str(gap_wide_betterID)
+
+
+

OUTPUT +

+
tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
+ $ continent     : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+
+
+

R +

+
+all.equal(gap_wide, gap_wide_betterID)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+
+

There and back again!

+

Other great resources +

+
+
+ +
+
+

Keypoints +

+
+
  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/15-knitr-markdown.html b/15-knitr-markdown.html new file mode 100644 index 000000000..b8c0f399d --- /dev/null +++ b/15-knitr-markdown.html @@ -0,0 +1,939 @@ + +R for Reproducible Scientific Analysis: Producing Reports With knitr +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Producing Reports With knitr

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I integrate software and reports?
  • +
+
+
+
+
+
+

Objectives

+
  • Understand the value of writing reproducible reports
  • +
  • Learn how to recognise and compile the basic components of an R +Markdown file
  • +
  • Become familiar with R code chunks, and understand their purpose, +structure and options
  • +
  • Demonstrate the use of inline chunks for weaving R outputs into text +blocks, for example when discussing the results of some +calculations
  • +
  • Be aware of alternative output formats to which an R Markdown file +can be exported
  • +
+
+
+
+
+

Data analysis reports +

+

Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their work +for future reference.

+

Many new users begin by first writing a single R script containing +all of their work, and then share the analysis by emailing the script +and various graphs as attachments. But this can be cumbersome, requiring +a lengthy discussion to explain which attachment was which result.

+

Writing formal reports with Word or LaTeX can simplify this +process by incorporating both the analysis report and output graphs into +a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy +“whack-a-mole” game of fixing new mistakes resulting from a single +formatting change.

+

Creating a report as a web page (which is an html file) using R +Markdown makes things easier. The report can be one long stream, so tall +figures that wouldn’t ordinarily fit on one page can be kept at full +size and easier to read, since the reader can simply keep scrolling. +Additionally, the formatting of and R Markdown document is simple and +easy to modify, allowing you to spend more time on your analyses instead +of writing reports.

+

Literate programming +

+

Ideally, such analysis reports are reproducible documents: +If an error is discovered, or if some additional subjects are added to +the data, you can just re-compile the report and get the new or +corrected results rather than having to reconstruct figures, paste them +into a Word document, and hand-edit various detailed results.

+

The key R package here is knitr. It allows you +to create a document that is a mixture of text and chunks of code. When +the document is processed by knitr, chunks of code will be +executed, and graphs or other results will be inserted into the final +document.

+

This sort of idea has been called “literate programming”.

+

knitr allows you to mix basically any type of text with +code from different programming languages, but we recommend that you use +R Markdown, which mixes Markdown with R. Markdown is a light-weight +mark-up language for creating web pages.

+

Creating an R Markdown file +

+

Within RStudio, click File → New File → R Markdown and you’ll get a +dialog box like this:

+
Screenshot of the New R Markdown file dialogue box in RStudio

You can stick with the default (HTML output), but give it a +title.

+

Basic components of R Markdown +

+

The initial chunk of text (header) contains instructions for R to +specify what kind of document will be created, and the options chosen. +You can use the header to give your document a title, author, date, and +tell it what type of output you want to produce. In this case, we’re +creating an html document.

+
---
+title: "Initial R Markdown document"
+author: "Karl Broman"
+date: "April 23, 2015"
+output: html_document
+---
+

You can delete any of those fields if you don’t want them included. +The double-quotes aren’t strictly necessary in this case. +They’re mostly needed if you want to include a colon in the title.

+

RStudio creates the document with some example text to get you +started. Note below that there are chunks like

+
+```{r}
+summary(cars)
+```
+
+

These are chunks of R code that will be executed by +knitr and replaced by their results. More on this +later.

+

Markdown +

+

Markdown is a system for writing web pages by marking up the text +much as you would in an email rather than writing html code. The +marked-up text gets converted to html, replacing the marks with +the proper html code.

+

For now, let’s delete all of the stuff that’s there and write a bit +of markdown.

+

You make things bold using two asterisks, like this: +**bold**, and you make things italics by using +underscores, like this: _italics_.

+

You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this:

+
A list:
+
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks
+

or like this:

+
A second list:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+

Each will appear as:

+
  • bold with double-asterisks
  • +
  • italics with underscores
  • +
  • code-type font with backticks
  • +

You can use whatever method you prefer, but be consistent. +This maintains the readability of your code.

+

You can make a numbered list by just using numbers. You can even use +the same number over and over if you want:

+
1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks
+

This will appear as:

+
  1. bold with double-asterisks
  2. +
  3. italics with underscores
  4. +
  5. code-type font with backticks
  6. +

You can make section headers of different sizes by initiating a line +with some number of # symbols:

+
# Title
+## Main section
+### Sub-section
+#### Sub-sub section
+

You compile the R Markdown document to an html webpage by +clicking the “Knit” button in the upper-left.

+
+
+ +
+
+

Challenge 1 +

+
+

Create a new R Markdown document. Delete all of the R code chunks and +write a bit of Markdown (some sections, some italicized text, and an +itemized list).

+

Convert the document to a webpage.

+
+
+
+
+
+ +
+
+

In RStudio, select File > New file > R Markdown…

+

Delete the placeholder text and add the following:

+
# Introduction
+
+## Background on Data
+
+This report uses the *gapminder* dataset, which has columns that include:
+
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+
+## Background on Methods
+
+

Then click the ‘Knit’ button on the toolbar to generate an html +document (webpage).

+
+
+
+
+

A bit more Markdown +

+

You can make a hyperlink like this: +[Carpentries Home Page](https://carpentries.org/).

+

You can include an image file like this: +![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)

+

You can do subscripts (e.g., F2) with F~2~ +and superscripts (e.g., F2) with F^2^.

+

If you know how to write equations in LaTeX, you can use +$ $ and $$ $$ to insert math equations, like +$E = mc^2$ and

+
$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$
+

You can review Markdown syntax by navigating to the “Markdown Quick +Reference” under the “Help” field in the toolbar at the top of +RStudio.

+

R code chunks +

+

The real power of Markdown comes from mixing markdown with chunks of +code. This is R Markdown. When processed, the R code will be executed; +if they produce figures, the figures will be inserted in the final +document.

+

The main code chunks look like this:

+
+```{r load_data}
+gapminder 
+

That is, you place a chunk of R code between ```{r +chunk_name} and ```. You should give each chunk a +unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the +shortcuts Ctrl+Alt+I on Windows and +Linux, or Cmd+Option+I on Mac.

+
+
+ +
+
+

Challenge 2 +

+
+

Add code chunks to:

+
  • Load the ggplot2 package
  • +
  • Read the gapminder data
  • +
  • Create a plot
  • +
+
+
+
+
+ +
+
+
+```{r load-ggplot2}
+library("ggplot2")
+```
+
+
+```{r read-gapminder-data}
+gapminder 
+
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+
+
+
+
+
+

How things get compiled +

+

When you press the “Knit” button, the R Markdown document is +processed by knitr +and a plain Markdown document is produced (as well as, potentially, a +set of figure files): the R code is executed and replaced by both the +input and the output; if figures are produced, links to those figures +are included.

+

The Markdown and figure documents are then processed by the tool pandoc, which converts the +Markdown file into an html file, with the figures embedded.

+

Chunk options +

+

There are a variety of options to affect how the code chunks are +treated. Here are some examples:

+
  • Use echo=FALSE to avoid having the code itself +shown.
  • +
  • Use results="hide" to avoid having any results +printed.
  • +
  • Use eval=FALSE to have the code shown but not +evaluated.
  • +
  • Use warning=FALSE and message=FALSE to +hide any warnings or messages produced.
  • +
  • Use fig.height and fig.width to control +the size of the figures produced (in inches).
  • +

So you might write:

+
+```{r load_libraries, echo=FALSE, message=FALSE}
+library("dplyr")
+library("ggplot2")
+```
+
+

Often there will be particular options that you’ll want to use +repeatedly; for this, you can set global chunk options, like +so:

+
+```{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)
+```
+
+

The fig.path option defines where the figures will be +saved. The / here is really important; without it, the +figures would be saved in the standard place but just with names that +begin with Figs.

+

If you have multiple R Markdown files in a common directory, you +might want to use fig.path to define separate prefixes for +the figure file names, like fig.path="Figs/cleaning-" and +fig.path="Figs/analysis-".

+
+
+ +
+
+

Challenge 3 +

+
+

Use chunk options to control the size of a figure and to hide the +code.

+
+
+
+
+
+ +
+
+
+```{r echo = FALSE, fig.width = 3}
+plot(faithful)
+```
+
+
+
+
+
+

You can review all of the R chunk options by navigating +to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the +“Help” field in the toolbar at the top of RStudio.

+

Inline R code +

+

You can make every number in your report reproducible. Use +`r and ` for an in-line code chunk, like so: +`r round(some_value, 2)`. The code will be executed and +replaced with the value of the result.

+

Don’t let these in-line chunks get split across lines.

+

Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with include=FALSE for +that larger chunk (which is the same as echo=FALSE and +results="hide").

+

Rounding can produce differences in output in such situations. You +may want 2.0, but round(2.03, 1) will give +just 2.

+

The myround +function in the R/broman +package handles this.

+
+
+ +
+
+

Challenge 4 +

+
+

Try out a bit of in-line R code.

+
+
+
+
+
+ +
+
+

Here’s some inline code to determine that 2 + 2 = 4.

+
+
+
+
+

Other output options +

+

You can also convert R Markdown to a PDF or a Word document. Click +the little triangle next to the “Knit” button to get a drop-down menu. +Or you could put pdf_document or word_document +in the initial header of the file.

+
+
+ +
+
+

Tip: Creating PDF documents +

+
+

Creating .pdf documents may require installation of some extra +software. The R package tinytex provides some tools to help +make this process easier for R users. With tinytex +installed, run tinytex::install_tinytex() to install the +required software (you’ll only need to do this once) and then when you +knit to pdf tinytex will automatically detect and install +any additional LaTeX packages that are needed to produce the pdf +document. Visit the tinytex +website for more information.

+
+
+
+
+
+ +
+
+

Tip: Visual markdown editing in RStudio +

+
+

RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.

+
+
+
+

Resources +

+
+
+ +
+
+

Keypoints +

+
+
  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/16-wrap-up.html b/16-wrap-up.html new file mode 100644 index 000000000..9bed07855 --- /dev/null +++ b/16-wrap-up.html @@ -0,0 +1,587 @@ + +R for Reproducible Scientific Analysis: Writing Good Software +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Writing Good Software

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I write software that other people can use?
  • +
+
+
+
+
+
+

Objectives

+
  • Describe best practices for writing R and explain the justification +for each.
  • +
+
+
+
+
+

Structure your project folder +

+

Keep your project folder structured, organized and tidy, by creating +subfolders for your code files, manuals, data, binaries, output plots, +etc. It can be done completely manually, or with the help of RStudio’s +New Project functionality, or a designated package, such as +ProjectTemplate.

+
+
+ +
+
+

Tip: ProjectTemplate - a possible +solution +

+
+

One way to automate the management of projects is to install the +third-party package, ProjectTemplate. This package will set +up an ideal directory structure for project management. This is very +useful as it enables you to have your analysis pipeline/workflow +organised and structured. Together with the default RStudio project +functionality and Git you will be able to keep track of your work as +well as be able to share your work with collaborators.

+
  1. Install ProjectTemplate.
  2. +
  3. Load the library
  4. +
  5. Initialise the project:
  6. +
+

R +

+
+install.packages("ProjectTemplate")
+library("ProjectTemplate")
+create.project("../my_project_2", merge.strategy = "allow.non.conflict")
+
+

For more information on ProjectTemplate and its functionality visit +the home page ProjectTemplate

+
+
+
+

Make code readable +

+

The most important part of writing code is making it readable and +understandable. You want someone else to be able to pick up your code +and be able to understand what it does: more often than not this someone +will be you 6 months down the line, who will otherwise be cursing +past-self.

+

Documentation: tell us what and why, not how +

+

When you first start out, your comments will often describe what a +command does, since you’re still learning yourself and it can help to +clarify concepts and remind you later. However, these comments aren’t +particularly useful later on when you don’t remember what problem your +code is trying to solve. Try to also include comments that tell you +why you’re solving a problem, and what problem that +is. The how can come after that: it’s an implementation detail +you ideally shouldn’t have to worry about.

+

Keep your code modular +

+

Our recommendation is that you should separate your functions from +your analysis scripts, and store them in a separate file that you +source when you open the R session in your project. This +approach is nice because it leaves you with an uncluttered analysis +script, and a repository of useful functions that can be loaded into any +analysis script in your project. It also lets you group related +functions together easily.

+

Break down problem into bite size pieces +

+

When you first start out, problem solving and function writing can be +daunting tasks, and hard to separate from code inexperience. Try to +break down your problem into digestible chunks and worry about the +implementation details later: keep breaking down the problem into +smaller and smaller functions until you reach a point where you can code +a solution, and build back up from there.

+

Know that your code is doing the right thing +

+

Make sure to test your functions!

+

Don’t repeat yourself +

+

Functions enable easy reuse within a project. If you see blocks of +similar lines of code through your project, those are usually candidates +for being moved into functions.

+

If your calculations are performed through a series of functions, +then the project becomes more modular and easier to change. This is +especially the case for which a particular input always gives a +particular output.

+

Remember to be stylish +

+

Apply consistent style to your code.

+
+
+ +
+
+

Keypoints +

+
+
  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+
+
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/404.html b/404.html new file mode 100644 index 000000000..2c0bde5ad --- /dev/null +++ b/404.html @@ -0,0 +1,451 @@ + +R for Reproducible Scientific Analysis: Page not found +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Page not found

+ +

Our apologies! +

+

We cannot seem to find the page you are looking for. Here are some +tips that may help:

+
  1. try going back to the previous +page or
  2. +
  3. navigate to any other page using the navigation bar on the +left.
  4. +
  5. if the URL ends with /index.html, try removing +that.
  6. +
  7. head over to the home page of this +lesson +
  8. +

If you came here from a link in this lesson, please contact the +lesson maintainers using the links at the foot of this page.

+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/CODE_OF_CONDUCT.html b/CODE_OF_CONDUCT.html new file mode 100644 index 000000000..f2d43ce19 --- /dev/null +++ b/CODE_OF_CONDUCT.html @@ -0,0 +1,450 @@ + +R for Reproducible Scientific Analysis: Contributor Code of Conduct +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Contributor Code of Conduct

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +

As contributors and maintainers of this project, we pledge to follow +the The +Carpentries Code of Conduct.

+

Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our reporting +guidelines.

+ + + +
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/LICENSE.html b/LICENSE.html new file mode 100644 index 000000000..fd0be828c --- /dev/null +++ b/LICENSE.html @@ -0,0 +1,501 @@ + +R for Reproducible Scientific Analysis: Licenses +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Licenses

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +

Instructional Material +

+

All Carpentries (Software Carpentry, Data Carpentry, and Library +Carpentry) instructional material is made available under the Creative Commons +Attribution license. The following is a human-readable summary of +(and not a substitute for) the full legal +text of the CC BY 4.0 license.

+

You are free:

+
  • to Share—copy and redistribute the material in any +medium or format
  • +
  • to Adapt—remix, transform, and build upon the +material
  • +

for any purpose, even commercially.

+

The licensor cannot revoke these freedoms as long as you follow the +license terms.

+

Under the following terms:

+
  • Attribution—You must give appropriate credit +(mentioning that your work is derived from work that is Copyright (c) +The Carpentries and, where practical, linking to https://carpentries.org/), provide a link to the +license, and indicate if changes were made. You may do so in any +reasonable manner, but not in any way that suggests the licensor +endorses you or your use.

  • +
  • No additional restrictions—You may not apply +legal terms or technological measures that legally restrict others from +doing anything the license permits. With the understanding +that:

  • +

Notices:

+
  • You do not have to comply with the license for elements of the +material in the public domain or where your use is permitted by an +applicable exception or limitation.
  • +
  • No warranties are given. The license may not give you all of the +permissions necessary for your intended use. For example, other rights +such as publicity, privacy, or moral rights may limit how you use the +material.
  • +

Software +

+

Except where otherwise noted, the example programs and other software +provided by The Carpentries are made available under the OSI-approved MIT +license.

+

Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +“Software”), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions:

+

The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software.

+

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. +IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY +CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, +TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE +SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

+

Trademark +

+

“The Carpentries”, “Software Carpentry”, “Data Carpentry”, and +“Library Carpentry” and their respective logos are registered trademarks +of Community Initiatives.

+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/aio.html b/aio.html new file mode 100644 index 000000000..ea9c08cd6 --- /dev/null +++ b/aio.html @@ -0,0 +1,12657 @@ + + + + + +R for Reproducible Scientific Analysis: All in One View + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + + +
+
+ + +

Content from Introduction to R and RStudio

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How to find your way around RStudio?
  • +
  • How to interact with R?
  • +
  • How to manage your environment?
  • +
  • How to install packages?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Describe the purpose and use of each pane in the RStudio IDE
  • +
  • Locate buttons and options in the RStudio IDE
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Manage a workspace in an interactive R session
  • +
  • Use mathematical and comparison operators
  • +
  • Call functions
  • +
  • Manage packages
  • +
+
+
+
+
+
+

Motivation +

+
+

Science is a multi-step process: once you’ve designed an experiment +and collected data, the real fun begins! This lesson will teach you how +to start this process using R and RStudio. We will begin with raw data, +perform exploratory analyses, and learn how to plot results graphically. +This example starts with a dataset from gapminder.org containing population +information for many countries through time. Can you read the data into +R? Can you plot the population for Senegal? Can you calculate the +average income for countries on the continent of Asia? By the end of +these lessons you will be able to do things like plot the populations +for all of these countries in under a minute!

+

Before Starting The Workshop +

+
+

Please ensure you have the latest version of R and RStudio installed +on your machine. This is important, as some packages used in the +workshop may not install correctly (or at all) if R is not up to +date.

+

Introduction to RStudio +

+
+

Welcome to the R portion of the Software Carpentry workshop.

+

Throughout this lesson, we’re going to teach you some of the +fundamentals of the R language as well as some best practices for +organizing code for scientific projects that will make your life +easier.

+

We’ll be using RStudio: a free, open-source R Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms +(including on servers) and provides many advantages such as integration +with version control and project management.

+

Basic layout

+

When you first open RStudio, you will be greeted by three panels:

+
    +
  • The interactive R console/Terminal (entire left)
  • +
  • Environment/History/Connections (tabbed in upper right)
  • +
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • +
+
RStudio layout

Once you open files, such as R scripts, an editor panel will also +open in the top left.

+
RStudio layout with .R file open
+
+ +
+
+

R scripts +

+
+

Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have .R at the end of their +names to let you know what they are.

+
+
+
+

Workflow within RStudio +

+
+

There are two main ways one can work within RStudio:

+
    +
  1. Test and play within the interactive R console then copy code into a +.R file to run later.
  2. +
+
    +
  • This works well when doing small tests and initially starting +off.
  • +
  • It quickly becomes laborious
  • +
+
    +
  1. Start writing in a .R file and use RStudio’s short cut keys for the +Run command to push the current line, selected lines or modified lines +to the interactive R console.
  2. +
+
    +
  • This is a great way to start; all your code is saved for later
  • +
  • You will be able to run the file you create from within RStudio or +using R’s source() function.
  • +
+
+
+ +
+
+

Tip: Running segments of your code +

+
+

RStudio offers you great flexibility in running code from within the +editor window. There are buttons, menu choices, and keyboard shortcuts. +To run the current line, you can

+
    +
  1. click on the Run button above the editor panel, or
  2. +
  3. select “Run Lines” from the “Code” menu, or
  4. +
  5. hit Ctrl+Return in Windows or Linux or ++Return on OS X. (This shortcut can also be seen +by hovering the mouse over the button). To run a block of code, select +it and then Run. If you have modified a line of code within +a block of code you have just run, there is no need to reselect the +section and Run, you can use the next button along, +Re-run the previous region. This will run the previous code +block including the modifications you have made.
  6. +
+
+
+
+

Introduction to R +

+
+

Much of your time in R will be spent in the R interactive console. +This is where you will run all of your code, and can be a useful +environment to try out ideas before adding them to an R script file. +This console in RStudio is the same as the one you would get if you +typed in R in your command-line environment.

+

The first thing you will see in the R interactive session is a bunch +of information, followed by a “>” and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a “Read, evaluate, print +loop”: you type in commands, R tries to execute them, and then returns a +result.

+

Using R as a calculator +

+
+

The simplest thing you could do with R is to do arithmetic:

+
+

R +

+
+1 + 100
+
+
+

OUTPUT +

+
[1] 101
+
+

And R will print out the answer, with a preceding “[1]”. [1] is the +index of the first element of the line being printed in the console. For +more information on indexing vectors, see Episode +6: Subsetting Data.

+

If you type in an incomplete command, R will wait for you to complete +it. If you are familiar with Unix Shell’s bash, you may recognize +this
+behavior from bash.

+
+

R +

+
> 1 +
+
+
+

OUTPUT +

+
+
+
+

Any time you hit return and the R session shows a “+” instead of a +“>”, it means it’s waiting for you to complete the command. If you +want to cancel a command you can hit Esc and RStudio will +give you back the “>” prompt.

+
+
+ +
+
+

Tip: Canceling commands +

+
+

If you’re using R from the command line instead of from within +RStudio, you need to use Ctrl+C instead of +Esc to cancel the command. This applies to Mac users as +well!

+

Canceling a command isn’t only useful for killing incomplete +commands: you can also use it to tell R to stop running code (for +example if it’s taking much longer than you expect), or to get rid of +the code you’re currently writing.

+
+
+
+

When using R as a calculator, the order of operations is the same as +you would have learned back in school.

+

From highest to lowest precedence:

+
    +
  • Parentheses: (, ) +
  • +
  • Exponents: ^ or ** +
  • +
  • Multiply: * +
  • +
  • Divide: / +
  • +
  • Add: + +
  • +
  • Subtract: - +
  • +
+
+

R +

+
+3 + 5 * 2
+
+
+

OUTPUT +

+
[1] 13
+
+

Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend.

+
+

R +

+
+(3 + 5) * 2
+
+
+

OUTPUT +

+
[1] 16
+
+

This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code.

+
+

R +

+
+(3 + (5 * (2 ^ 2))) # hard to read
+3 + 5 * 2 ^ 2       # clear, if you remember the rules
+3 + 5 * (2 ^ 2)     # if you forget some rules, this might help
+
+

The text after each line of code is called a “comment”. Anything that +follows after the hash (or octothorpe) symbol # is ignored +by R when it executes code.

+

Really small or large numbers get a scientific notation:

+
+

R +

+
+2/10000
+
+
+

OUTPUT +

+
[1] 2e-04
+
+

Which is shorthand for “multiplied by 10^XX”. So +2e-4 is shorthand for 2 * 10^(-4).

+

You can write numbers in scientific notation too:

+
+

R +

+
+5e3  # Note the lack of minus here
+
+
+

OUTPUT +

+
[1] 5000
+
+

Mathematical functions +

+
+

R has many built in mathematical functions. To call a function, we +can type its name, followed by open and closing parentheses. Functions +take arguments as inputs, anything we type inside the parentheses of a +function is considered an argument. Depending on the function, the +number of arguments can vary from none to multiple. For example:

+
+

R +

+
+getwd() #returns an absolute filepath
+
+

doesn’t require an argument, whereas for the next set of mathematical +functions we will need to supply the function a value in order to +compute the result.

+
+

R +

+
+sin(1)  # trigonometry functions
+
+
+

OUTPUT +

+
[1] 0.841471
+
+
+

R +

+
+log(1)  # natural logarithm
+
+
+

OUTPUT +

+
[1] 0
+
+
+

R +

+
+log10(10) # base-10 logarithm
+
+
+

OUTPUT +

+
[1] 1
+
+
+

R +

+
+exp(0.5) # e^(1/2)
+
+
+

OUTPUT +

+
[1] 1.648721
+
+

Don’t worry about trying to remember every function in R. You can +look them up on Google, or if you can remember the start of the +function’s name, use the tab completion in RStudio.

+

This is one advantage that RStudio has over R on its own, it has +auto-completion abilities that allow you to more easily look up +functions, their arguments, and the values that they take.

+

Typing a ? before the name of a command will open the +help page for that command. When using RStudio, this will open the +‘Help’ pane; if using R in the terminal, the help page will open in your +browser. The help page will include a detailed description of the +command and how it works. Scrolling to the bottom of the help page will +usually show a collection of code examples which illustrate command +usage. We’ll go through an example later.

+

Comparing things +

+
+

We can also do comparisons in R:

+
+

R +

+
+1 == 1  # equality (note two equals signs, read as "is equal to")
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 != 2  # inequality (read as "is not equal to")
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 < 2  # less than
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 <= 1  # less than or equal to
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 > 0  # greater than
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 >= -9 # greater than or equal to
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+
+ +
+
+

Tip: Comparing Numbers +

+
+

A word of warning about comparing numbers: you should never use +== to compare two numbers unless they are integers (a data +type which can specifically represent only whole numbers).

+

Computers may only represent decimal numbers with a certain degree of +precision, so two numbers which look the same when printed out by R, may +actually have different underlying representations and therefore be +different by a small margin of error (called Machine numeric +tolerance).

+

Instead you should use the all.equal function.

+

Further reading: http://floating-point-gui.de/

+
+
+
+

Variables and assignment +

+
+

We can store values in variables using the assignment operator +<-, like this:

+
+

R +

+
+x <- 1/40
+
+

Notice that assignment does not print a value. Instead, we stored it +for later in something called a variable. +x now contains the value +0.025:

+
+

R +

+
+x
+
+
+

OUTPUT +

+
[1] 0.025
+
+

More precisely, the stored value is a decimal approximation +of this fraction called a floating point +number.

+

Look for the Environment tab in the top right panel of +RStudio, and you will see that x and its value have +appeared. Our variable x can be used in place of a number +in any calculation that expects a number:

+
+

R +

+
+log(x)
+
+
+

OUTPUT +

+
[1] -3.688879
+
+

Notice also that variables can be reassigned:

+
+

R +

+
+x <- 100
+
+

x used to contain the value 0.025 and now it has the +value 100.

+

Assignment values can contain the variable being assigned to:

+
+

R +

+
+x <- x + 1 #notice how RStudio updates its description of x on the top right tab
+y <- x * 2
+
+

The right hand side of the assignment can be any valid R expression. +The right hand side is fully evaluated before the assignment +occurs.

+

Variable names can contain letters, numbers, underscores and periods +but no spaces. They must start with a letter or a period followed by a +letter (they cannot start with a number nor an underscore). Variables +beginning with a period are hidden variables. Different people use +different conventions for long variable names, these include

+
    +
  • periods.between.words
  • +
  • underscores_between_words
  • +
  • camelCaseToSeparateWords
  • +
+

What you use is up to you, but be consistent.

+

It is also possible to use the = operator for +assignment:

+
+

R +

+
+x = 1/40
+
+

But this is much less common among R users. The most important thing +is to be consistent with the operator you use. There +are occasionally places where it is less confusing to use +<- than =, and it is the most common symbol +used in the community. So the recommendation is to use +<-.

+
+
+ +
+
+

Challenge 1 +

+
+

Which of the following are valid R variable names?

+
+

R +

+
min_height
+max.height
+_age
+.mass
+MaxLength
+min-length
+2widths
+celsius2kelvin
+
+
+
+
+
+
+ +
+
+

The following can be used as R variables:

+
+

R +

+
+min_height
+max.height
+MaxLength
+celsius2kelvin
+
+

The following creates a hidden variable:

+
+

R +

+
+.mass
+
+

The following will not be able to be used to create a variable

+
+

R +

+
_age
+min-length
+2widths
+
+
+
+
+
+

Vectorization +

+
+

One final thing to be aware of is that R is vectorized, +meaning that variables and functions can have vectors as values. In +contrast to physics and mathematics, a vector in R describes a set of +values in a certain order of the same data type. For example

+
+

R +

+
+1:5
+
+
+

OUTPUT +

+
[1] 1 2 3 4 5
+
+
+

R +

+
+2^(1:5)
+
+
+

OUTPUT +

+
[1]  2  4  8 16 32
+
+
+

R +

+
+x <- 1:5
+2^x
+
+
+

OUTPUT +

+
[1]  2  4  8 16 32
+
+

This is incredibly powerful; we will discuss this further in an +upcoming lesson.

+

Managing your environment +

+
+

There are a few useful commands you can use to interact with the R +session.

+

ls will list all of the variables and functions stored +in the global environment (your working R session):

+
+

R +

+
+ls()
+
+
+

OUTPUT +

+
[1] "x" "y"
+
+
+
+ +
+
+

Tip: hidden objects +

+
+

Like in the shell, ls will hide any variables or +functions starting with a “.” by default. To list all objects, type +ls(all.names=TRUE) instead

+
+
+
+

Note here that we didn’t give any arguments to ls, but +we still needed to give the parentheses to tell R to call the +function.

+

If we type ls by itself, R prints a bunch of code +instead of a listing of objects.

+
+

R +

+
+ls
+
+
+

OUTPUT +

+
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
+    pattern, sorted = TRUE) 
+{
+    if (!missing(name)) {
+        pos <- tryCatch(name, error = function(e) e)
+        if (inherits(pos, "error")) {
+            name <- substitute(name)
+            if (!is.character(name)) 
+                name <- deparse(name)
+            warning(gettextf("%s converted to character string", 
+                sQuote(name)), domain = NA)
+            pos <- name
+        }
+    }
+    all.names <- .Internal(ls(envir, all.names, sorted))
+    if (!missing(pattern)) {
+        if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
+            ll != length(grep("]", pattern, fixed = TRUE))) {
+            if (pattern == "[") {
+                pattern <- "\\["
+                warning("replaced regular expression pattern '[' by  '\\\\['")
+            }
+            else if (length(grep("[^\\\\]\\[<-", pattern))) {
+                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+            }
+        }
+        grep(pattern, all.names, value = TRUE)
+    }
+    else all.names
+}
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>
+
+

What’s going on here?

+

Like everything in R, ls is the name of an object, and +entering the name of an object by itself prints the contents of the +object. The object x that we created earlier contains 1, 2, +3, 4, 5:

+
+

R +

+
+x
+
+
+

OUTPUT +

+
[1] 1 2 3 4 5
+
+

The object ls contains the R code that makes the +ls function work! We’ll talk more about how functions work +and start writing our own later.

+

You can use rm to delete objects you no longer need:

+
+

R +

+
+rm(x)
+
+

If you have lots of things in your environment and want to delete all +of them, you can pass the results of ls to the +rm function:

+
+

R +

+
+rm(list = ls())
+
+

In this case we’ve combined the two. Like the order of operations, +anything inside the innermost parentheses is evaluated first, and so +on.

+

In this case we’ve specified that the results of ls +should be used for the list argument in rm. +When assigning values to arguments by name, you must use the += operator!!

+

If instead we use <-, there will be unintended side +effects, or you may get an error message:

+
+

R +

+
+rm(list <- ls())
+
+
+

ERROR +

+
Error in rm(list <- ls()): ... must contain names or character strings
+
+
+
+ +
+
+

Tip: Warnings vs. Errors +

+
+

Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn’t worked as expected.

+

In both cases, the message that R prints out usually give you clues +how to fix a problem.

+
+
+
+

R Packages +

+
+

It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages:

+
    +
  • You can see what packages are installed by typing +installed.packages() +
  • +
  • You can install packages by typing +install.packages("packagename"), where +packagename is the package name, in quotes.
  • +
  • You can update installed packages by typing +update.packages() +
  • +
  • You can remove a package with +remove.packages("packagename") +
  • +
  • You can make a package available for use with +library(packagename) +
  • +
+

Packages can also be viewed, loaded, and detached in the Packages tab +of the lower right panel in RStudio. Clicking on this tab will display +all of the installed packages with a checkbox next to them. If the box +next to a package name is checked, the package is loaded and if it is +empty, the package is not loaded. Click an empty box to load that +package and click a checked box to detach that package.

+

Packages can be installed and updated from the Package tab with the +Install and Update buttons at the top of the tab.

+
+
+ +
+
+

Challenge 2 +

+
+

What will be the value of each variable after each statement in the +following program?

+
+

R +

+
+mass <- 47.5
+age <- 122
+mass <- mass * 2.3
+age <- age - 20
+
+
+
+
+
+
+ +
+
+
+

R +

+
+mass <- 47.5
+
+

This will give a value of 47.5 for the variable mass

+
+

R +

+
+age <- 122
+
+

This will give a value of 122 for the variable age

+
+

R +

+
+mass <- mass * 2.3
+
+

This will multiply the existing value of 47.5 by 2.3 to give a new +value of 109.25 to the variable mass.

+
+

R +

+
+age <- age - 20
+
+

This will subtract 20 from the existing value of 122 to give a new +value of 102 to the variable age.

+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age?

+
+
+
+
+
+ +
+
+

One way of answering this question in R is to use the +> to set up the following:

+
+

R +

+
+mass > age
+
+
+

OUTPUT +

+
[1] TRUE
+
+

This should yield a boolean value of TRUE since 109.25 is greater +than 102.

+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Clean up your working environment by deleting the mass and age +variables.

+
+
+
+
+
+ +
+
+

We can use the rm command to accomplish this task

+
+

R +

+
+rm(age, mass)
+
+
+
+
+
+
+
+ +
+
+

Challenge 5 +

+
+

Install the following packages: ggplot2, +plyr, gapminder

+
+
+
+
+
+ +
+
+

We can use the install.packages() command to install the +required packages.

+
+

R +

+
+install.packages("ggplot2")
+install.packages("plyr")
+install.packages("gapminder")
+
+

An alternate solution, to install multiple packages with a single +install.packages() command is:

+
+

R +

+
+install.packages(c("ggplot2", "plyr", "gapminder"))
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +
+
+
+
+

Content from Project Management With RStudio

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I manage my projects in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Create self-contained projects in RStudio
  • +
+
+
+
+
+
+

Introduction +

+
+

The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and eventually +everything is a bit mixed together.

+ +

Most people tend to organize their projects like this:

+
Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

+
    +
  1. It is really hard to tell which version of your data is the original +and which is the modified;
  2. +
  3. It gets really messy because it mixes files with various extensions +together;
  4. +
  5. It probably takes you a lot of time to actually find things, and +relate the correct figures to the exact code that has been used to +generate it;
  6. +
+

A good project layout will ultimately make your life easier:

+
    +
  • It will help ensure the integrity of your data;
  • +
  • It makes it simpler to share your code with someone else (a +lab-mate, collaborator, or supervisor);
  • +
  • It allows you to easily upload your code with your manuscript +submission;
  • +
  • It makes it easier to pick the project back up after a break.
  • +

A possible solution +

+
+

Fortunately, there are tools and packages which can help you manage +your work effectively.

+

One of the most powerful and useful aspects of RStudio is its project +management functionality. We’ll be using this today to create a +self-contained, reproducible project.

+
+
+ +
+
+

Challenge 1: Creating a self-contained +project +

+
+

We’re going to create a new project in RStudio:

+
    +
  1. Click the “File” menu button, then “New Project”.
  2. +
  3. Click “New Directory”.
  4. +
  5. Click “New Project”.
  6. +
  7. Type in the name of the directory to store your project, +e.g. “my_project”.
  8. +
  9. If available, select the checkbox for “Create a git +repository.”
  10. +
  11. Click the “Create Project” button.
  12. +
+
+
+
+

The simplest way to open an RStudio project once it has been created +is to click through your file system to get to the directory where it +was saved and double click on the .Rproj file. This will +open RStudio and start your R session in the same directory as the +.Rproj file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added +benefit of allowing you to open multiple projects at the same time each +open to its own project directory. This allows you to keep multiple +projects open without them interfering with each other.

+
+
+ +
+
+

Challenge 2: Opening an RStudio project +through the file system +

+
+
    +
  1. Exit RStudio.
  2. +
  3. Navigate to the directory where you created a project in Challenge +1.
  4. +
  5. Double click on the .Rproj file in that directory.
  6. +
+
+
+
+

Best practices for project organization +

+
+

Although there is no “best” way to lay out a project, there are some +general principles to adhere to that will make project management +easier:

+
+

Treat data as read only +

+

This is probably the most important goal of setting up a project. +Data is typically time consuming and/or expensive to collect. Working +with them interactively (e.g., in Excel) where they can be modified +means you are never sure of where the data came from, or how it has been +modified since collection. It is therefore a good idea to treat your +data as “read-only”.

+
+
+

Data Cleaning +

+

In many cases your data will be “dirty”: it will need significant +preprocessing to get into a format R (or any other programming language) +will find useful. This task is sometimes called “data munging”. Storing +these scripts in a separate folder, and creating a second “read-only” +data folder to hold the “cleaned” data sets can prevent confusion +between the two sets.

+
+
+

Treat generated output as disposable +

+

Anything generated by your scripts should be treated as disposable: +it should all be able to be regenerated from your scripts.

+

There are lots of different ways to manage this output. Having an +output folder with different sub-directories for each separate analysis +makes it easier later. Since many analyses are exploratory and don’t end +up being used in the final project, and some of the analyses get shared +between projects.

+
+
+ +
+
+

Tip: Good Enough Practices for Scientific +Computing +

+
+

Good +Enough Practices for Scientific Computing gives the following +recommendations for project organization:

+
    +
  1. Put each project in its own directory, which is named after the +project.
  2. +
  3. Put text documents associated with the project in the +doc directory.
  4. +
  5. Put raw data and metadata in the data directory, and +files generated during cleanup and analysis in a results +directory.
  6. +
  7. Put source for the project’s scripts and programs in the +src directory, and programs brought in from elsewhere or +compiled locally in the bin directory.
  8. +
  9. Name all files to reflect their content or function.
  10. +
+
+
+
+
+
+

Separate function definition and application +

+

One of the more effective ways to work with R is to start by writing +the code you want to run directly in a .R script, and then running the +selected lines (either using the keyboard shortcuts in RStudio or +clicking the “Run” button) in the interactive R console.

+

When your project is in its early stages, the initial .R script file +usually contains many lines of directly executed code. As it matures, +reusable chunks get pulled into their own functions. It’s a good idea to +separate these functions into two separate folders; one to store useful +functions that you’ll reuse across analyses and projects, and one to +store the analysis scripts.

+
+
+

Save the data in the data directory +

+

Now we have a good directory structure we will now place/save the +data file in the data/ directory.

+
+
+ +
+
+

Challenge 3 +

+
+

Download the gapminder data from here.

+
    +
  1. Download the file (right mouse click on the link above -> “Save +link as” / “Save file as”, or click on the link and after the page +loads, press Ctrl+S or choose File -> “Save +page as”)
  2. +
  3. Make sure it’s saved under the name +gapminder_data.csv +
  4. +
  5. Save the file in the data/ folder within your +project.
  6. +
+

We will load and inspect these data later.

+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

It is useful to get some general idea about the dataset, directly +from the command line, before loading it into R. Understanding the +dataset better will come in handy when making decisions on how to load +it in R. Use the command-line shell to answer the following +questions:

+
    +
  1. What is the size of the file?
  2. +
  3. How many rows of data does it contain?
  4. +
  5. What kinds of values are stored in this file?
  6. +
+
+
+
+
+
+ +
+
+

By running these commands in the shell:

+
+

SH +

+
ls -lh data/gapminder_data.csv
+
+
+

OUTPUT +

+
-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv
+
+

The file size is 80K.

+
+

SH +

+
wc -l data/gapminder_data.csv
+
+
+

OUTPUT +

+
1705 data/gapminder_data.csv
+
+

There are 1705 lines. The data looks like:

+
+

SH +

+
head data/gapminder_data.csv
+
+
+

OUTPUT +

+
country,year,pop,continent,lifeExp,gdpPercap
+Afghanistan,1952,8425333,Asia,28.801,779.4453145
+Afghanistan,1957,9240934,Asia,30.332,820.8530296
+Afghanistan,1962,10267083,Asia,31.997,853.10071
+Afghanistan,1967,11537966,Asia,34.02,836.1971382
+Afghanistan,1972,13079460,Asia,36.088,739.9811058
+Afghanistan,1977,14880372,Asia,38.438,786.11336
+Afghanistan,1982,12881816,Asia,39.854,978.0114388
+Afghanistan,1987,13867957,Asia,40.822,852.3959448
+Afghanistan,1992,16317921,Asia,41.674,649.3413952
+
+
+
+
+
+
+
+ +
+
+

Tip: command line in RStudio +

+
+

The Terminal tab in the console pane provides a convenient place +directly within RStudio to interact directly with the command line.

+
+
+
+
+
+

Working directory +

+

Knowing R’s current working directory is important because when you +need to access other files (for example, to import a data file), R will +look for them relative to the current working directory.

+

Each time you create a new RStudio Project, it will create a new +directory for that project. When you open an existing +.Rproj file, it will open that project and set R’s working +directory to the folder that file is in.

+
+
+ +
+
+

Challenge 5 +

+
+

You can check the current working directory with the +getwd() command, or by using the menus in RStudio.

+
    +
  1. In the console, type getwd() (“wd” is short for +“working directory”) and hit Enter.
  2. +
  3. In the Files pane, double click on the data folder to +open it (or navigate to any other folder you wish). To get the Files +pane back to the current working directory, click “More” and then select +“Go To Working Directory”.
  4. +
+

You can change the working directory with setwd(), or by +using RStudio menus.

+
    +
  1. In the console, type setwd("data") and hit Enter. Type +getwd() and hit Enter to see the new working +directory.
  2. +
  3. In the menus at the top of the RStudio window, click the “Session” +menu button, and then select “Set Working Directory” and then “Choose +Directory”. Next, in the windows navigator that opens, navigate back to +the project directory, and click “Open”. Note that a setwd +command will automatically appear in the console.
  4. +
+
+
+
+
+
+ +
+
+

Tip: File does not exist errors +

+
+

When you’re attempting to reference a file in your R code and you’re +getting errors saying the file doesn’t exist, it’s a good idea to check +your working directory. You need to either provide an absolute path to +the file, or you need to make sure the file is saved in the working +directory (or a subfolder of the working directory) and provide a +relative path.

+
+
+
+
+
+

Version Control +

+

It is important to use version control with projects. Go here +for a good lesson which describes using Git with RStudio.

+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +
+
+
+
+
+

Content from Seeking Help

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I get help in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to read R help files for functions and special +operators.
  • +
  • To be able to use CRAN task views to identify packages to solve a +problem.
  • +
  • To be able to seek help from your peers.
  • +
+
+
+
+
+
+

Reading Help Files +

+
+

R, and every package, provide help files for functions. The general +syntax to search for help on any function, “function_name”, from a +specific function that is in a package loaded into your namespace (your +interactive R session) is:

+
+

R +

+
+?function_name
+help(function_name)
+
+

For example take a look at the help file for +write.table(), we will be using a similar function in an +upcoming episode.

+
+

R +

+
+?write.table()
+
+

This will load up a help page in RStudio (or as plain text in R +itself).

+

Each help page is broken down into sections:

+
    +
  • Description: An extended description of what the function does.
  • +
  • Usage: The arguments of the function and their default values (which +can be changed).
  • +
  • Arguments: An explanation of the data each argument is +expecting.
  • +
  • Details: Any important details to be aware of.
  • +
  • Value: The data the function returns.
  • +
  • See Also: Any related functions you might find useful.
  • +
  • Examples: Some examples for how to use the function.
  • +
+

Different functions might have different sections, but these are the +main ones you should be aware of.

+

Notice how related functions might call for the same help file:

+
+

R +

+
+?write.table()
+?write.csv()
+
+

This is because these functions have very similar applicability and +often share the same arguments as inputs to the function, so package +authors often choose to document them together in a single help +file.

+
+
+ +
+
+

Tip: Running Examples +

+
+

From within the function help page, you can highlight code in the +Examples and hit Ctrl+Return to run it in RStudio +console. This gives you a quick way to get a feel for how a function +works.

+
+
+
+
+
+ +
+
+

Tip: Reading Help Files +

+
+

One of the most daunting aspects of R is the large number of +functions available. It would be prohibitive, if not impossible to +remember the correct usage for every function you use. Luckily, using +the help files means you don’t have to remember that!

+
+
+
+

Special Operators +

+
+

To seek help on special operators, use quotes or backticks:

+
+

R +

+
+?"<-"
+?`<-`
+
+

Getting Help with Packages +

+
+

Many packages come with “vignettes”: tutorials and extended example +documentation. Without any arguments, vignette() will list +all vignettes for all installed packages; +vignette(package="package-name") will list all available +vignettes for package-name, and +vignette("vignette-name") will open the specified +vignette.

+

If a package doesn’t have any vignettes, you can usually find help by +typing help("package-name").

+

RStudio also has a set of excellent cheatsheets for +many packages.

+

When You Remember Part of the Function Name +

+
+

If you’re not sure what package a function is in or how it’s +specifically spelled, you can do a fuzzy search:

+
+

R +

+
+??function_name
+
+

A fuzzy search is when you search for an approximate string match. +For example, you may remember that the function to set your working +directory includes “set” in its name. You can do a fuzzy search to help +you identify the function:

+
+

R +

+
+??set
+
+

When You Have No Idea Where to Begin +

+
+

If you don’t know what function or package you need to use CRAN Task Views is a +specially maintained list of packages grouped into fields. This can be a +good starting point.

+

When Your Code Doesn’t Work: Seeking Help from Your Peers +

+
+

If you’re having trouble using a function, 9 times out of 10, the +answers you seek have already been answered on Stack Overflow. You can search +using the [r] tag. Please make sure to see their page on how to ask a good +question.

+

If you can’t find the answer, there are a few useful functions to +help you ask your peers:

+
+

R +

+
+?dput
+
+

Will dump the data you’re working with into a format that can be +copied and pasted by others into their own R session.

+
+

R +

+
+sessionInfo()
+
+
+

OUTPUT +

+
R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
+ [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
+ [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
+[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
+
+time zone: UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats     graphics  grDevices utils     datasets  methods   base     
+
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0 yaml_2.3.7       
+[5] knitr_1.43        xfun_0.40         renv_1.0.3        evaluate_0.21    
+
+

Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue.

+
+
+ +
+
+

Challenge 1 +

+
+

Look at the help page for the c function. What kind of +vector do you expect will be created if you evaluate the following:

+
+

R +

+
+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+
+
+
+
+
+
+ +
+
+

The c() function creates a vector, in which all elements +are of the same type. In the first case, the elements are numeric, in +the second, they are characters, and in the third they are also +characters: the numeric values are “coerced” to be characters.

+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Look at the help for the paste function. You will need +to use it later. What’s the difference between the sep and +collapse arguments?

+
+
+
+
+
+ +
+
+

To look at the help for the paste() function, use:

+
+

R +

+
+help("paste")
+?paste
+
+

The difference between sep and collapse is +a little tricky. The paste function accepts any number of +arguments, each of which can be a vector of any length. The +sep argument specifies the string used between concatenated +terms — by default, a space. The result is a vector as long as the +longest argument supplied to paste. In contrast, +collapse specifies that after concatenation the elements +are collapsed together using the given separator, the result +being a single string.

+

It is important to call the arguments explicitly by typing out the +argument name e.g sep = "," so the function understands to +use the “,” as a separator and not a term to concatenate. e.g.

+
+

R +

+
+paste(c("a","b"), "c")
+
+
+

OUTPUT +

+
[1] "a c" "b c"
+
+
+

R +

+
+paste(c("a","b"), "c", ",")
+
+
+

OUTPUT +

+
[1] "a c ," "b c ,"
+
+
+

R +

+
+paste(c("a","b"), "c", sep = ",")
+
+
+

OUTPUT +

+
[1] "a,c" "b,c"
+
+
+

R +

+
+paste(c("a","b"), "c", collapse = "|")
+
+
+

OUTPUT +

+
[1] "a c|b c"
+
+
+

R +

+
+paste(c("a","b"), "c", sep = ",", collapse = "|")
+
+
+

OUTPUT +

+
[1] "a,c|b,c"
+
+

(For more information, scroll to the bottom of the +?paste help page and look at the examples, or try +example('paste').)

+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Use help to find a function (and its associated parameters) that you +could use to load data from a tabular file in which columns are +delimited with “\t” (tab) and the decimal point is a “.” (period). This +check for decimal separator is important, especially if you are working +with international colleagues, because different countries have +different conventions for the decimal point (i.e. comma vs period). +Hint: use ??"read table" to look up functions related to +reading in tabular data.

+
+
+
+
+
+ +
+
+

The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +read.table(file, sep="\t") (the period is the +default decimal separator for read.table()), +although you may have to change the comment.char argument +as well if your data file contains hash (#) characters.

+
+
+
+
+

Other Resources +

+
+ +
+
+ +
+
+

Keypoints +

+
+
    +
  • Use help() to get online help in R.
  • +
+
+
+
+

Content from Data Structures

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I read data in R?
  • +
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to identify the 5 main data types.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
  • To understand the information of the attributes “names”, “class”, +and “dim”.
  • +
+
+
+
+
+
+

One of R’s most powerful features is its ability to deal with tabular +data - such as you may already have in a spreadsheet or a CSV file. +Let’s start by making a toy dataset in your data/ +directory, called feline-data.csv:

+
+

R +

+
+cats <- data.frame(coat = c("calico", "black", "tabby"),
+                    weight = c(2.1, 5.0, 3.2),
+                    likes_string = c(1, 0, 1))
+
+

We can now save cats as a CSV file. It is good practice +to call the argument names explicitly so the function knows what default +values you are changing. Here we are setting +row.names = FALSE. Recall you can use +?write.csv to pull up the help file to check out the +argument names and their default values.

+
+

R +

+
+write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)
+
+

The contents of the new file, feline-data.csv:

+
+

R +

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+
+
+
+ +
+
+

Tip: Editing Text files in R +

+
+

Alternatively, you can create data/feline-data.csv using +a text editor (Nano), or within RStudio with the File -> New +File -> Text File menu item.

+
+
+
+

We can load this into R via the following:

+
+

R +

+
+cats <- read.csv(file = "data/feline-data.csv")
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1
+
+

The read.table function is used for reading in tabular +data stored in a text file where the columns of data are separated by +punctuation characters such as CSV files (csv = comma-separated values). +Tabs and commas are the most common punctuation characters used to +separate or delimit data points in csv files. For convenience R provides +2 other versions of read.table. These are: +read.csv for files where the data are separated with commas +and read.delim for files where the data are separated with +tabs. Of these three functions read.csv is the most +commonly used. If needed it is possible to override the default +delimiting punctuation marks for both read.csv and +read.delim.

+
+
+ +
+
+

Check your data for factors +

+
+

In recent times, the default way how R handles textual data has +changed. Text data was interpreted by R automatically into a format +called “factors”. But there is an easier format that is called +“character”. We will hear about factors later, and what to use them for. +For now, remember that in most cases, they are not needed and only +complicate your life, which is why newer R versions read in text as +“character”. Check now if your version of R has automatically created +factors and convert them to “character” format:

+
    +
  1. Check the data types of your input by typing +str(cats) +
  2. +
  3. In the output, look at the three-letter codes after the colons: If +you see only “num” and “chr”, you can continue with the lesson and skip +this box. If you find “fct”, continue to step 3.
  4. +
  5. Prevent R from automatically creating “factor” data. That can be +done by the following code: +options(stringsAsFactors = FALSE). Then, re-read the cats +table for the change to take effect.
  6. +
  7. You must set this option every time you restart R. To not forget +this, include it in your analysis script before you read in any data, +for example in one of the first lines.
  8. +
  9. For R versions greater than 4.0.0, text data is no longer converted +to factors anymore. So you can install this or a newer version to avoid +this problem. If you are working on an institute or company computer, +ask your administrator to do it.
  10. +
+
+
+
+

We can begin exploring our dataset right away, pulling out columns by +specifying them using the $ operator:

+
+

R +

+
+cats$weight
+
+
+

OUTPUT +

+
[1] 2.1 5.0 3.2
+
+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

We can do other operations on the columns:

+
+

R +

+
+## Say we discovered that the scale weighs two Kg light:
+cats$weight + 2
+
+
+

OUTPUT +

+
[1] 4.1 7.0 5.2
+
+
+

R +

+
+paste("My cat is", cats$coat)
+
+
+

OUTPUT +

+
[1] "My cat is calico" "My cat is black"  "My cat is tabby" 
+
+

But what about

+
+

R +

+
+cats$weight + cats$coat
+
+
+

ERROR +

+
Error in cats$weight + cats$coat: non-numeric argument to binary operator
+
+

Understanding what happened here is key to successfully analyzing +data in R.

+
+

Data Types +

+

If you guessed that the last command will return an error because +2.1 plus "black" is nonsense, you’re right - +and you already have some intuition for an important concept in +programming called data types. We can ask what type of data +something is:

+
+

R +

+
+typeof(cats$weight)
+
+
+

OUTPUT +

+
[1] "double"
+
+

There are 5 main types: double, integer, +complex, logical and character. +For historic reasons, double is also called +numeric.

+
+

R +

+
+typeof(3.14)
+
+
+

OUTPUT +

+
[1] "double"
+
+
+

R +

+
+typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers
+
+
+

OUTPUT +

+
[1] "integer"
+
+
+

R +

+
+typeof(1+1i)
+
+
+

OUTPUT +

+
[1] "complex"
+
+
+

R +

+
+typeof(TRUE)
+
+
+

OUTPUT +

+
[1] "logical"
+
+
+

R +

+
+typeof('banana')
+
+
+

OUTPUT +

+
[1] "character"
+
+

No matter how complicated our analyses become, all data in R is +interpreted as one of these basic data types. This strictness has some +really important consequences.

+

A user has added details of another cat. This information is in the +file data/feline-data_v2.csv.

+
+

R +

+
+file.show("data/feline-data_v2.csv")
+
+
+

R +

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+tabby,2.3 or 2.4,1
+
+

Load the new cats data like before, and check what type of data we +find in the weight column:

+
+

R +

+
+cats <- read.csv(file="data/feline-data_v2.csv")
+typeof(cats$weight)
+
+
+

OUTPUT +

+
[1] "character"
+
+

Oh no, our weights aren’t the double type anymore! If we try to do +the same math we did on them before, we run into trouble:

+
+

R +

+
+cats$weight + 2
+
+
+

ERROR +

+
Error in cats$weight + 2: non-numeric argument to binary operator
+
+

What happened? The cats data we are working with is +something called a data frame. Data frames are one of the most +common and versatile types of data structures we will work with +in R. A given column in a data frame cannot be composed of different +data types. In this case, R does not read everything in the data frame +column weight as a double, therefore the entire +column data type changes to something that is suitable for everything in +the column.

+

When R reads a csv file, it reads it in as a data frame. +Thus, when we loaded the cats csv file, it is stored as a +data frame. We can recognize data frames by the first row that is +written by the str() function:

+
+

R +

+
+str(cats)
+
+
+

OUTPUT +

+
'data.frame':	4 obs. of  3 variables:
+ $ coat        : chr  "calico" "black" "tabby" "tabby"
+ $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
+ $ likes_string: int  1 0 1 1
+
+

Data frames are composed of rows and columns, where each +column has the same number of rows. Different columns in a data frame +can be made up of different data types (this is what makes them so +versatile), but everything in a given column needs to be the same type +(e.g., vector, factor, or list).

+

Let’s explore more about different data structures and how they +behave. For now, let’s remove that extra line from our cats data and +reload it, while we investigate this behavior further:

+

feline-data.csv:

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+

And back in RStudio:

+
+

R +

+
+cats <- read.csv(file="data/feline-data.csv")
+
+
+
+

Vectors and Type Coercion +

+

To better understand this behavior, let’s meet another of the data +structures: the vector.

+
+

R +

+
+my_vector <- vector(length = 3)
+my_vector
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE
+
+

A vector in R is essentially an ordered list of things, with the +special condition that everything in the vector must be the same +basic data type. If you don’t choose the datatype, it’ll default to +logical; or, you can declare an empty vector of whatever +type you like.

+
+

R +

+
+another_vector <- vector(mode='character', length=3)
+another_vector
+
+
+

OUTPUT +

+
[1] "" "" ""
+
+

You can check if something is a vector:

+
+

R +

+
+str(another_vector)
+
+
+

OUTPUT +

+
 chr [1:3] "" "" ""
+
+

The somewhat cryptic output from this command indicates the basic +data type found in this vector - in this case chr, +character; an indication of the number of things in the vector - +actually, the indexes of the vector, in this case [1:3]; +and a few examples of what’s actually in the vector - in this case empty +character strings. If we similarly do

+
+

R +

+
+str(cats$weight)
+
+
+

OUTPUT +

+
 num [1:3] 2.1 5 3.2
+
+

we see that cats$weight is a vector, too - the +columns of data we load into R data.frames are all vectors, and +that’s the root of why R forces everything in a column to be the same +basic data type.

+
+
+ +
+
+

Discussion 1 +

+
+

Why is R so opinionated about what we put in our columns of data? How +does this help us?

+
+
+ +
+
+

By keeping everything in a column the same, we allow ourselves to +make simple assumptions about our data; if you can interpret one entry +in the column as a number, then you can interpret all of them +as numbers, so we don’t have to check every time. This consistency is +what people mean when they talk about clean data; in the long +run, strict consistency goes a long way to making our lives easier in +R.

+
+
+
+
+
+
+
+
+

Coercion by combining vectors +

+

You can also make vectors with explicit contents with the combine +function:

+
+

R +

+
+combine_vector <- c(2,6,3)
+combine_vector
+
+
+

OUTPUT +

+
[1] 2 6 3
+
+

Given what we’ve learned so far, what do you think the following will +produce?

+
+

R +

+
+quiz_vector <- c(2,6,'3')
+
+

This is something called type coercion, and it is the source +of many surprises and the reason why we need to be aware of the basic +data types and how R will interpret them. When R encounters a mix of +types (here double and character) to be combined into a single vector, +it will force them all to be the same type. Consider:

+
+

R +

+
+coercion_vector <- c('a', TRUE)
+coercion_vector
+
+
+

OUTPUT +

+
[1] "a"    "TRUE"
+
+
+

R +

+
+another_coercion_vector <- c(0, TRUE)
+another_coercion_vector
+
+
+

OUTPUT +

+
[1] 0 1
+
+
+
+

The type hierarchy +

+

The coercion rules go: logical -> +integer -> double (“numeric”) +-> complex -> character, where -> can +be read as are transformed into. For example, combining +logical and character transforms the result to +character:

+
+

R +

+
+c('a', TRUE)
+
+
+

OUTPUT +

+
[1] "a"    "TRUE"
+
+

A quick way to recognize character vectors is by the +quotes that enclose them when they are printed.

+

You can try to force coercion against this flow using the +as. functions:

+
+

R +

+
+character_vector_example <- c('0','2','4')
+character_vector_example
+
+
+

OUTPUT +

+
[1] "0" "2" "4"
+
+
+

R +

+
+character_coerced_to_double <- as.double(character_vector_example)
+character_coerced_to_double
+
+
+

OUTPUT +

+
[1] 0 2 4
+
+
+

R +

+
+double_coerced_to_logical <- as.logical(character_coerced_to_double)
+double_coerced_to_logical
+
+
+

OUTPUT +

+
[1] FALSE  TRUE  TRUE
+
+

As you can see, some surprising things can happen when R forces one +basic data type into another! Nitty-gritty of type coercion aside, the +point is: if your data doesn’t look like what you thought it was going +to look like, type coercion may well be to blame; make sure everything +is the same type in your vectors and your columns of data.frames, or you +will get nasty surprises!

+

But coercion can also be very useful! For example, in our +cats data likes_string is numeric, but we know +that the 1s and 0s actually represent TRUE and +FALSE (a common way of representing them). We should use +the logical datatype here, which has two states: +TRUE or FALSE, which is exactly what our data +represents. We can ‘coerce’ this column to be logical by +using the as.logical function:

+
+

R +

+
+cats$likes_string
+
+
+

OUTPUT +

+
[1] 1 0 1
+
+
+

R +

+
+cats$likes_string <- as.logical(cats$likes_string)
+cats$likes_string
+
+
+

OUTPUT +

+
[1]  TRUE FALSE  TRUE
+
+
+
+ +
+
+

Challenge 1 +

+
+

An important part of every data analysis is cleaning the input data. +If you know that the input data is all of the same format, +(e.g. numbers), your analysis is much easier! Clean the cat data set +from the chapter about type coercion.

+
+

Copy the code template +

+

Create a new script in RStudio and copy and paste the following code. +Then move on to the tasks below, which help you to fill in the gaps +(______).

+
# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+
+# 1. Print the data
+_____
+
+# 2. Show an overview of the table with all data types
+_____(cats)
+
+# 3. The "weight" column has the incorrect data type __________.
+#    The correct data type is: ____________.
+
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+#    print the data again to see the effect
+cats
+
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+
+#    Calculate the mean to test yourself
+mean(cats$weight)
+
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!
+
+
+

Instructions for the tasks +

+
+ +

Execute the first statement (read.csv(...)). Then print +the data to the console

+
+
+
+
+
+
+
+ +
+
+

Show the content of any variable by typing its name.

+
+

Solution to Challenge 1.1 +

+

Two correct solutions:

+
cats
+print(cats)
+
+
+
+
+
+
+
+ +
+
+

2. Overview of the data types +

+
+

The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of +the cats table.

+
+
+
+
+
+ +
+
+

In the chapter “Data types” we saw two functions that can show data +types. One printed just a single word, the data type name. The other +printed a short form of the data type, and the first few values. We need +the second here.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to Challenge 1.2

+
str(cats)
+
+
+

3. Which data type do we need? +

+

The shown data type is not the right one for this data (weight of a +cat). Which data type do we need?

+
    +
  • Why did the read.csv() function not choose the correct +data type?
  • +
  • Fill in the gap in the comment with the correct data type for cat +weight!
  • +
+
+
+
+
+
+
+ +
+
+

Scroll up to the section about the type +hierarchy to review the available data types

+
+
+
+
+
+
+ +
+
+
    +
  • Weight is expressed on a continuous scale (real numbers). The R data +type for this is “double” (also known as “numeric”).
  • +
  • The fourth row has the value “2.3 or 2.4”. That is not a number but +two, and an english word. Therefore, the “character” data type is +chosen. The whole column is now text, because all values in the same +columns have to be the same data type.
  • +
+
+
+
+
+
+
+ +
+
+

4. Correct the problematic value +

+
+

The code to assign a new weight value to the problematic fourth row +is given. Think first and then execute it: What will be the data type +after assigning a number like in this example? You can check the data +type after executing to see if you were right.

+
+
+
+
+
+ +
+
+

Revisit the hierarchy of data types when two different data types are +combined.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to challenge 1.4

+

The data type of the column “weight” is “character”. The assigned +data type is “double”. Combining two data types yields the data type +that is higher in the following hierarchy:

+
logical < integer < double < complex < character
+

Therefore, the column is still of type character! We need to manually +convert it to “double”. {: .solution}

+
+
+

5. Convert the column “weight” to the correct data type +

+

Cat weight are numbers. But the column does not have this data type +yet. Coerce the column to floating point numbers.

+
+
+
+
+
+
+ +
+
+

The functions to convert data types start with as.. You +can look for the function further up in the manuscript or use the +RStudio auto-complete function: Type “as.” and then press +the TAB key.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to Challenge 1.5

+

There are two functions that are synonymous for historic reasons:

+
cats$weight <- as.double(cats$weight)
+cats$weight <- as.numeric(cats$weight)
+
+
+
+
+
+
+
+

Some basic vector functions +

+

The combine function, c(), will also append things to an +existing vector:

+
+

R +

+
+ab_vector <- c('a', 'b')
+ab_vector
+
+
+

OUTPUT +

+
[1] "a" "b"
+
+
+

R +

+
+combine_example <- c(ab_vector, 'SWC')
+combine_example
+
+
+

OUTPUT +

+
[1] "a"   "b"   "SWC"
+
+

You can also make series of numbers:

+
+

R +

+
+mySeries <- 1:10
+mySeries
+
+
+

OUTPUT +

+
 [1]  1  2  3  4  5  6  7  8  9 10
+
+
+

R +

+
+seq(10)
+
+
+

OUTPUT +

+
 [1]  1  2  3  4  5  6  7  8  9 10
+
+
+

R +

+
+seq(1,10, by=0.1)
+
+
+

OUTPUT +

+
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
+[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
+[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
+[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
+[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
+[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
+[91] 10.0
+
+

We can ask a few questions about vectors:

+
+

R +

+
+sequence_example <- 20:25
+head(sequence_example, n=2)
+
+
+

OUTPUT +

+
[1] 20 21
+
+
+

R +

+
+tail(sequence_example, n=4)
+
+
+

OUTPUT +

+
[1] 22 23 24 25
+
+
+

R +

+
+length(sequence_example)
+
+
+

OUTPUT +

+
[1] 6
+
+
+

R +

+
+typeof(sequence_example)
+
+
+

OUTPUT +

+
[1] "integer"
+
+

We can get individual elements of a vector by using the bracket +notation:

+
+

R +

+
+first_element <- sequence_example[1]
+first_element
+
+
+

OUTPUT +

+
[1] 20
+
+

To change a single element, use the bracket on the other side of the +arrow:

+
+

R +

+
+sequence_example[1] <- 30
+sequence_example
+
+
+

OUTPUT +

+
[1] 30 21 22 23 24 25
+
+
+
+ +
+
+

Challenge 2 +

+
+

Start by making a vector with the numbers 1 through 26. Then, +multiply the vector by 2.

+
+
+
+
+
+ +
+
+
+

R +

+
+x <- 1:26
+x <- x * 2
+
+
+
+
+
+
+
+

Lists +

+

Another data structure you’ll want in your bag of tricks is the +list. A list is simpler in some ways than the other types, +because you can put anything you want in it. Remember everything in +the vector must be of the same basic data type, but a list can have +different data types:

+
+

R +

+
+list_example <- list(1, "a", TRUE, 1+4i)
+list_example
+
+
+

OUTPUT +

+
[[1]]
+[1] 1
+
+[[2]]
+[1] "a"
+
+[[3]]
+[1] TRUE
+
+[[4]]
+[1] 1+4i
+
+

When printing the object structure with str(), we see +the data types of all elements:

+
+

R +

+
+str(list_example)
+
+
+

OUTPUT +

+
List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i
+
+

What is the use of lists? They can organize data of different +types. For example, you can organize different tables that +belong together, similar to spreadsheets in Excel. But there are many +other uses, too.

+

We will see another example that will maybe surprise you in the next +chapter.

+

To retrieve one of the elements of a list, use the double +bracket:

+
+

R +

+
+list_example[[2]]
+
+
+

OUTPUT +

+
[1] "a"
+
+

The elements of lists also can have names, they can +be given by prepending them to the values, separated by an equals +sign:

+
+

R +

+
+another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
+another_list
+
+
+

OUTPUT +

+
$title
+[1] "Numbers"
+
+$numbers
+ [1]  1  2  3  4  5  6  7  8  9 10
+
+$data
+[1] TRUE
+
+

This results in a named list. Now we have a new +function of our object! We can access single elements by an additional +way!

+
+

R +

+
+another_list$title
+
+
+

OUTPUT +

+
[1] "Numbers"
+
+
+

Names +

+
+

With names, we can give meaning to elements. It is the first time +that we do not only have the data, but also explaining +information. It is metadata that can be stuck to the object +like a label. In R, this is called an attribute. Some +attributes enable us to do more with our object, for example, like here, +accessing an element by a self-defined name.

+
+

Accessing vectors and lists by name +

+

We have already seen how to generate a named list. The way to +generate a named vector is very similar. You have seen this function +before:

+
+

R +

+
+pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )
+
+

The way to retrieve elements is different, though:

+
+

R +

+
+pizza_price["pizzasubito"]
+
+
+

OUTPUT +

+
pizzasubito 
+       5.64 
+
+

The approach used for the list does not work:

+
+

R +

+
+pizza_price$pizzafresh
+
+
+

ERROR +

+
Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors
+
+

It will pay off if you remember this error message, you will meet it +in your own analyses. It means that you have just tried accessing an +element like it was in a list, but it is actually in a vector.

+
+
+

Accessing and changing names +

+

If you are only interested in the names, use the names() +function:

+
+

R +

+
+names(pizza_price)
+
+
+

OUTPUT +

+
[1] "pizzasubito" "pizzafresh"  "callapizza" 
+
+

We have seen how to access and change single elements of a vector. +The same is possible for names:

+
+

R +

+
+names(pizza_price)[3]
+
+
+

OUTPUT +

+
[1] "callapizza"
+
+
+

R +

+
+names(pizza_price)[3] <- "call-a-pizza"
+pizza_price
+
+
+

OUTPUT +

+
 pizzasubito   pizzafresh call-a-pizza 
+        5.64         6.60         4.50 
+
+
+
+ +
+
+

Challenge 3 +

+
+
    +
  • What is the data type of the names of pizza_price? You +can find out using the str() or typeof() +functions.
  • +
+
+
+
+
+
+ +
+
+

You get the names of an object by wrapping the object name inside +names(...). Similarly, you get the data type of the names +by again wrapping the whole code in typeof(...):

+
typeof(names(pizza))
+

alternatively, use a new variable if this is easier for you to +read:

+
n <- names(pizza)
+typeof(n)
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Instead of just changing some of the names a vector/list already has, +you can also set all names of an object by writing code like (replace +ALL CAPS text):

+
names( OBJECT ) <-  CHARACTER_VECTOR
+

Create a vector that gives the number for each letter in the +alphabet!

+
    +
  1. Generate a vector called letter_no with the sequence of +numbers from 1 to 26!
  2. +
  3. R has a built-in object called LETTERS. It is a +26-character vector, from A to Z. Set the names of the number sequence +to this 26 letters
  4. +
  5. Test yourself by calling letter_no["B"], which should +give you the number 2!
  6. +
+
+
+
+
+
+ +
+
+
letter_no <- 1:26   # or seq(1,26)
+names(letter_no) <- LETTERS
+letter_no["B"]
+
+
+
+
+
+

Data frames +

+
+

We have data frames at the very beginning of this lesson, they +represent a table of data. We didn’t go much further into detail with +our example cat data frame:

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+2  black    5.0        FALSE
+3  tabby    3.2         TRUE
+
+

We can now understand something a bit surprising in our data.frame; +what happens if we run:

+
+

R +

+
+typeof(cats)
+
+
+

OUTPUT +

+
[1] "list"
+
+

We see that data.frames look like lists ‘under the hood’. Think again +what we heard about what lists can be used for:

+
+

Lists organize data of different types

+
+

Columns of a data frame are vectors of different types, that are +organized by belonging to the same table.

+

A data.frame is really a list of vectors. It is a special list in +which all the vectors must have the same length.

+

How is this “special”-ness written into the object, so that R does +not treat it like any other list, but as a table?

+
+

R +

+
+class(cats)
+
+
+

OUTPUT +

+
[1] "data.frame"
+
+

A class, just like names, is an attribute attached +to the object. It tells us what this object means for humans.

+

You might wonder: Why do we need another +what-type-of-object-is-this-function? We already have +typeof()? That function tells us how the object is +constructed in the computer. The class is +the meaning of the object for humans. Consequently, +what typeof() returns is fixed in R (mainly the +five data types), whereas the output of class() is +diverse and extendable by R packages.

+

In our cats example, we have an integer, a double and a +logical variable. As we have seen already, each column of data.frame is +a vector.

+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+
+

R +

+
+cats[,1]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+
+

R +

+
+typeof(cats[,1])
+
+
+

OUTPUT +

+
[1] "character"
+
+
+

R +

+
+str(cats[,1])
+
+
+

OUTPUT +

+
 chr [1:3] "calico" "black" "tabby"
+
+

Each row is an observation of different variables, itself a +data.frame, and thus can be composed of elements of different types.

+
+

R +

+
+cats[1,]
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+
+
+

R +

+
+typeof(cats[1,])
+
+
+

OUTPUT +

+
[1] "list"
+
+
+

R +

+
+str(cats[1,])
+
+
+

OUTPUT +

+
'data.frame':	1 obs. of  3 variables:
+ $ coat        : chr "calico"
+ $ weight      : num 2.1
+ $ likes_string: logi TRUE
+
+
+
+ +
+
+

Challenge 5 +

+
+

There are several subtly different ways to call variables, +observations and elements from data.frames:

+
    +
  • cats[1]
  • +
  • cats[[1]]
  • +
  • cats$coat
  • +
  • cats["coat"]
  • +
  • cats[1, 1]
  • +
  • cats[, 1]
  • +
  • cats[1, ]
  • +
+

Try out these examples and explain what is returned by each one.

+

Hint: Use the function typeof() to examine what +is returned in each case.

+
+
+
+
+
+ +
+
+
+

R +

+
+cats[1]
+
+
+

OUTPUT +

+
    coat
+1 calico
+2  black
+3  tabby
+
+

We can think of a data frame as a list of vectors. The single brace +[1] returns the first slice of the list, as another list. +In this case it is the first column of the data frame.

+
+

R +

+
+cats[[1]]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

The double brace [[1]] returns the contents of the list +item. In this case it is the contents of the first column, a +vector of type character.

+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

This example uses the $ character to address items by +name. coat is the first column of the data frame, again a +vector of type character.

+
+

R +

+
+cats["coat"]
+
+
+

OUTPUT +

+
    coat
+1 calico
+2  black
+3  tabby
+
+

Here we are using a single brace ["coat"] replacing the +index number with the column name. Like example 1, the returned object +is a list.

+
+

R +

+
+cats[1, 1]
+
+
+

OUTPUT +

+
[1] "calico"
+
+

This example uses a single brace, but this time we provide row and +column coordinates. The returned object is the value in row 1, column 1. +The object is a vector of type character.

+
+

R +

+
+cats[, 1]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

Like the previous example we use single braces and provide row and +column coordinates. The row coordinate is not specified, R interprets +this missing value as all the elements in this column and +returns them as a vector.

+
+

R +

+
+cats[1, ]
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+
+

Again we use the single brace with row and column coordinates. The +column coordinate is not specified. The return value is a list +containing all the values in the first row.

+
+
+
+
+
+
+ +
+
+

Tip: Renaming data frame columns +

+
+

Data frames have column names, which can be accessed with the +names() function.

+
+

R +

+
+names(cats)
+
+
+

OUTPUT +

+
[1] "coat"         "weight"       "likes_string"
+
+

If you want to rename the second column of cats, you can +assign a new name to the second element of names(cats).

+
+

R +

+
+names(cats)[2] <- "weight_kg"
+cats
+
+
+

OUTPUT +

+
    coat weight_kg likes_string
+1 calico       2.1         TRUE
+2  black       5.0        FALSE
+3  tabby       3.2         TRUE
+
+
+
+
+
+

Matrices +

+

Last but not least is the matrix. We can declare a matrix full of +zeros:

+
+

R +

+
+matrix_example <- matrix(0, ncol=6, nrow=3)
+matrix_example
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    0    0    0    0    0    0
+[2,]    0    0    0    0    0    0
+[3,]    0    0    0    0    0    0
+
+

What makes it special is the dim() attribute:

+
+

R +

+
+dim(matrix_example)
+
+
+

OUTPUT +

+
[1] 3 6
+
+

And similar to other data structures, we can ask things about our +matrix:

+
+

R +

+
+typeof(matrix_example)
+
+
+

OUTPUT +

+
[1] "double"
+
+
+

R +

+
+class(matrix_example)
+
+
+

OUTPUT +

+
[1] "matrix" "array" 
+
+
+

R +

+
+str(matrix_example)
+
+
+

OUTPUT +

+
 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
+
+
+

R +

+
+nrow(matrix_example)
+
+
+

OUTPUT +

+
[1] 3
+
+
+

R +

+
+ncol(matrix_example)
+
+
+

OUTPUT +

+
[1] 6
+
+
+
+ +
+
+

Challenge 6 +

+
+

What do you think will be the result of +length(matrix_example)? Try it. Were you right? Why / why +not?

+
+
+
+
+
+ +
+
+

What do you think will be the result of +length(matrix_example)?

+
+

R +

+
+matrix_example <- matrix(0, ncol=6, nrow=3)
+length(matrix_example)
+
+
+

OUTPUT +

+
[1] 18
+
+

Because a matrix is a vector with added dimension attributes, +length gives you the total number of elements in the +matrix.

+
+
+
+
+
+
+ +
+
+

Challenge 7 +

+
+

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+
+
+
+
+
+ +
+
+

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+
+

R +

+
+x <- matrix(1:50, ncol=5, nrow=10)
+x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
+
+
+
+
+
+
+
+ +
+
+

Challenge 8 +

+
+

Create a list of length two containing a character vector for each of +the sections in this part of the workshop:

+
    +
  • Data types
  • +
  • Data structures
  • +
+

Populate each character vector with the names of the data types and +data structures we’ve seen so far.

+
+
+
+
+
+ +
+
+
+

R +

+
+dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
+dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
+answer <- list(dataTypes, dataStructures)
+
+

Note: it’s nice to make a list in big writing on the board or taped +to the wall listing all of these types and structures - leave it up for +the rest of the workshop to remind people of the importance of these +basics.

+
+
+
+
+
+
+ +
+
+

Challenge 9 +

+
+

Consider the R output of the matrix below:

+
+

OUTPUT +

+
     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7
+
+

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

+
    +
  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. +
  3. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  4. +
  5. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  6. +
  7. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
  8. +
+
+
+
+
+
+ +
+
+

Consider the R output of the matrix below:

+
+

OUTPUT +

+
     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7
+
+

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

+
+

R +

+
+matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +
+
+
+
+
+

Content from Exploring Data Frames

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I manipulate a data frame?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Add and remove rows or columns.
  • +
  • Append two data frames.
  • +
  • Display basic properties of data frames including size and class of +the columns, names, and first few rows.
  • +
+
+
+
+
+
+

At this point, you’ve seen it all: in the last lesson, we toured all +the basic data types and data structures in R. Everything you do will be +a manipulation of those tools. But most of the time, the star of the +show is the data frame—the table that we created by loading information +from a csv file. In this lesson, we’ll learn a few more things about +working with data frames.

+

Adding columns and rows in data frames +

+
+

We already learned that the columns of a data frame are vectors, so +that our data are consistent in type throughout the columns. As such, if +we want to add a new column, we can start by making a new vector:

+
+

R +

+
+age <- c(2, 3, 5)
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1
+
+

We can then add this as a column via:

+
+

R +

+
+cbind(cats, age)
+
+
+

OUTPUT +

+
    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5
+
+

Note that if we tried to add a vector of ages with a different number +of entries than the number of rows in the data frame, it would fail:

+
+

R +

+
+age <- c(2, 3, 5, 12)
+cbind(cats, age)
+
+
+

ERROR +

+
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4
+
+
+

R +

+
+age <- c(2, 3)
+cbind(cats, age)
+
+
+

ERROR +

+
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
+
+

Why didn’t this work? Of course, R wants to see one element in our +new column for every row in the table:

+
+

R +

+
+nrow(cats)
+
+
+

OUTPUT +

+
[1] 3
+
+
+

R +

+
+length(age)
+
+
+

OUTPUT +

+
[1] 2
+
+

So for it to work we need to have nrow(cats) = +length(age). Let’s overwrite the content of cats with our +new data frame.

+
+

R +

+
+age <- c(2, 3, 5)
+cats <- cbind(cats, age)
+
+

Now how about adding rows? We already know that the rows of a data +frame are lists:

+
+

R +

+
+newRow <- list("tortoiseshell", 3.3, TRUE, 9)
+cats <- rbind(cats, newRow)
+
+

Let’s confirm that our new row was added correctly.

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+
+

Removing rows +

+
+

We now know how to add rows and columns to our data frame in R. Now +let’s learn to remove rows.

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+
+

We can ask for a data frame minus the last row:

+
+

R +

+
+cats[-4, ]
+
+
+

OUTPUT +

+
    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5
+
+

Notice the comma with nothing after it to indicate that we want to +drop the entire fourth row.

+

Note: we could also remove several rows at once by putting the row +numbers inside of a vector, for example: +cats[c(-3,-4), ]

+

Removing columns +

+
+

We can also remove columns in our data frame. What if we want to +remove the column “age”. We can remove it in two ways, by variable +number or by index.

+
+

R +

+
+cats[,-4]
+
+
+

OUTPUT +

+
           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1
+
+

Notice the comma with nothing before it, indicating we want to keep +all of the rows.

+

Alternatively, we can drop the column by using the index name and the +%in% operator. The %in% operator goes through +each element of its left argument, in this case the names of +cats, and asks, “Does this element occur in the second +argument?”

+
+

R +

+
+drop <- names(cats) %in% c("age")
+cats[,!drop]
+
+
+

OUTPUT +

+
           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1
+
+

We will cover subsetting with logical operators like +%in% in more detail in the next episode. See the section Subsetting through other logical +operations

+

Appending to a data frame +

+
+

The key to remember when adding data to a data frame is that +columns are vectors and rows are lists. We can also glue two +data frames together with rbind:

+
+

R +

+
+cats <- rbind(cats, cats)
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+
+

But now the row names are unnecessarily complicated. We can remove +the rownames, and R will automatically re-name them sequentially:

+
+

R +

+
+rownames(cats) <- NULL
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+
+
+
+ +
+
+

Challenge 1 +

+
+

You can create a new data frame right from within R with the +following syntax:

+
+

R +

+
+df <- data.frame(id = c("a", "b", "c"),
+                 x = 1:3,
+                 y = c(TRUE, TRUE, FALSE))
+
+

Make a data frame that holds the following information for +yourself:

+
    +
  • first name
  • +
  • last name
  • +
  • lucky number
  • +
+

Then use rbind to add an entry for the people sitting +beside you. Finally, use cbind to add a column with each +person’s answer to the question, “Is it time for coffee break?”

+
+
+
+
+
+ +
+
+
+

R +

+
+df <- data.frame(first = c("Grace"),
+                 last = c("Hopper"),
+                 lucky_number = c(0))
+df <- rbind(df, list("Marie", "Curie", 238) )
+df <- cbind(df, coffeetime = c(TRUE,TRUE))
+
+
+
+
+
+

Realistic example +

+
+

So far, you have seen the basics of manipulating data frames with our +cat data; now let’s use those skills to digest a more realistic dataset. +Let’s read in the gapminder dataset that we downloaded +previously:

+
+

R +

+
+gapminder <- read.csv("data/gapminder_data.csv")
+
+
+
+ +
+
+

Miscellaneous Tips +

+
+
    +
  • Another type of file you might encounter are tab-separated value +files (.tsv). To specify a tab as a separator, use "\\t" or +read.delim().

  • +
  • Files can also be downloaded directly from the Internet into a +local folder of your choice onto your computer using the +download.file function. The read.csv function +can then be executed to read the downloaded file from the download +location, for example,

  • +
+
+

R +

+
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv("data/gapminder_data.csv")
+
+
    +
  • Alternatively, you can also read in files directly into R from the +Internet by replacing the file paths with a web address in +read.csv. One should note that in doing this no local copy +of the csv file is first saved onto your computer. For example,
  • +
+
+

R +

+
+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
+
+
    +
  • You can read directly from excel spreadsheets without converting +them to plain text first by using the readxl +package.

  • +
  • The argument “stringsAsFactors” can be useful to tell R how to +read strings either as factors or as character strings. In R versions +after 4.0, all strings are read-in as characters by default, but in +earlier versions of R, strings are read-in as factors by default. For +more information, see the call-out in the +previous episode.

  • +
+
+
+
+

Let’s investigate gapminder a bit; the first thing we should always +do is check out what the data looks like with str:

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+

An additional method for examining the structure of gapminder is to +use the summary function. This function can be used on +various objects in R. For data frames, summary yields a +numeric, tabular, or descriptive summary of each column. Numeric or +integer columns are described by the descriptive statistics (quartiles +and mean), and character columns by its length, class, and mode.

+
+

R +

+
+summary(gapminder)
+
+
+

OUTPUT +

+
   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  
+
+

Along with the str and summary functions, +we can examine individual columns of the data frame with our +typeof function:

+
+

R +

+
+typeof(gapminder$year)
+
+
+

OUTPUT +

+
[1] "integer"
+
+
+

R +

+
+typeof(gapminder$country)
+
+
+

OUTPUT +

+
[1] "character"
+
+
+

R +

+
+str(gapminder$country)
+
+
+

OUTPUT +

+
 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+
+

We can also interrogate the data frame for information about its +dimensions; remembering that str(gapminder) said there were +1704 observations of 6 variables in gapminder, what do you think the +following will produce, and why?

+
+

R +

+
+length(gapminder)
+
+
+

OUTPUT +

+
[1] 6
+
+

A fair guess would have been to say that the length of a data frame +would be the number of rows it has (1704), but this is not the case; +remember, a data frame is a list of vectors and factors:

+
+

R +

+
+typeof(gapminder)
+
+
+

OUTPUT +

+
[1] "list"
+
+

When length gave us 6, it’s because gapminder is built +out of a list of 6 columns. To get the number of rows and columns in our +dataset, try:

+
+

R +

+
+nrow(gapminder)
+
+
+

OUTPUT +

+
[1] 1704
+
+
+

R +

+
+ncol(gapminder)
+
+
+

OUTPUT +

+
[1] 6
+
+

Or, both at once:

+
+

R +

+
+dim(gapminder)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+

We’ll also likely want to know what the titles of all the columns +are, so we can ask for them later:

+
+

R +

+
+colnames(gapminder)
+
+
+

OUTPUT +

+
[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
+
+

At this stage, it’s important to ask ourselves if the structure R is +reporting matches our intuition or expectations; do the basic data types +reported for each column make sense? If not, we need to sort any +problems out now before they turn into bad surprises down the road, +using what we’ve learned about how R interprets data, and the importance +of strict consistency in how we record our data.

+

Once we’re happy that the data types and structures seem reasonable, +it’s time to start digging into our data proper. Check out the first few +lines:

+
+

R +

+
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+
+
+
+ +
+
+

Challenge 2 +

+
+

It’s good practice to also check the last few lines of your data and +some in the middle. How would you do this?

+

Searching for ones specifically in the middle isn’t too hard, but we +could ask for a few lines at random. How would you code this?

+
+
+
+
+
+ +
+
+

To check the last few lines it’s relatively simple as R already has a +function for this:

+
+

R +

+
+tail(gapminder)
+tail(gapminder, n = 15)
+
+

What about a few arbitrary rows just in case something is odd in the +middle?

+
+

Tip: There are several ways to achieve this. +

+

The solution here presents one form of using nested functions, i.e. a +function passed as an argument to another function. This might sound +like a new concept, but you are already using it! Remember +my_dataframe[rows, cols] will print to screen your data frame with the +number of rows and columns you asked for (although you might have asked +for a range or named columns for example). How would you get the last +row if you don’t know how many rows your data frame has? R has a +function for this. What about getting a (pseudorandom) sample? R also +has a function for this.

+
+

R +

+
+gapminder[sample(nrow(gapminder), 5), ]
+
+
+
+
+
+
+

To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later.

+
+
+ +
+
+

Challenge 3 +

+
+

Go to file -> new file -> R script, and write an R script to +load in the gapminder dataset. Put it in the scripts/ +directory and add it to version control.

+

Run the script using the source function, using the file +path as its argument (or by pressing the “source” button in +RStudio).

+
+
+
+
+
+ +
+
+

The source function can be used to use a script within a +script. Assume you would like to load the same type of file over and +over again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again and +again you could just write it once and save it as a script. Then, you +can use source("Your_Script_containing_the_load_function") +in a new script to use the function of that script without writing +everything again. Check out ?source to find out more.

+
+

R +

+
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv(file = "data/gapminder_data.csv")
+
+

To run the script and load the data into the gapminder +variable:

+
+

R +

+
+source(file = "scripts/load-gapminder.R")
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Read the output of str(gapminder) again; this time, use +what you’ve learned about lists and vectors, as well as the output of +functions like colnames and dim to explain +what everything that str prints out for gapminder means. If +there are any parts you can’t interpret, discuss with your +neighbors!

+
+
+
+
+
+ +
+
+

The object gapminder is a data frame with columns

+
    +
  • +country and continent are character +strings.
  • +
  • +year is an integer vector.
  • +
  • +pop, lifeExp, and gdpPercap +are numeric vectors.
  • +
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +
+
+
+
+

Content from Subsetting Data

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I work with subsets of data in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to subset vectors, factors, matrices, lists, and data +frames
  • +
  • To be able to extract individual and multiple elements: by index, by +name, using comparison operations
  • +
  • To be able to skip and remove elements from various data +structures.
  • +
+
+
+
+
+
+

R has many powerful subset operators. Mastering them will allow you +to easily perform complex operations on any kind of dataset.

+

There are six different ways we can subset any kind of object, and +three different subsetting operators for the different data +structures.

+

Let’s start with the workhorse of R: a simple numeric vector.

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+x
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+
+
+ +
+
+

Atomic vectors +

+
+

In R, simple vectors containing character strings, numbers, or +logical values are called atomic vectors because they can’t be +further simplified.

+
+
+
+

So now that we’ve created a dummy vector to play with, how do we get +at its contents?

+

Accessing elements using their indices +

+
+

To extract elements of a vector we can give their corresponding +index, starting from one:

+
+

R +

+
+x[1]
+
+
+

OUTPUT +

+
  a 
+5.4 
+
+
+

R +

+
+x[4]
+
+
+

OUTPUT +

+
  d 
+4.8 
+
+

It may look different, but the square brackets operator is a +function. For vectors (and matrices), it means “get me the nth +element”.

+

We can ask for multiple elements at once:

+
+

R +

+
+x[c(1, 3)]
+
+
+

OUTPUT +

+
  a   c 
+5.4 7.1 
+
+

Or slices of the vector:

+
+

R +

+
+x[1:4]
+
+
+

OUTPUT +

+
  a   b   c   d 
+5.4 6.2 7.1 4.8 
+
+

the : operator creates a sequence of numbers from the +left element to the right.

+
+

R +

+
+1:4
+
+
+

OUTPUT +

+
[1] 1 2 3 4
+
+
+

R +

+
+c(1, 2, 3, 4)
+
+
+

OUTPUT +

+
[1] 1 2 3 4
+
+

We can ask for the same element multiple times:

+
+

R +

+
+x[c(1,1,3)]
+
+
+

OUTPUT +

+
  a   a   c 
+5.4 5.4 7.1 
+
+

If we ask for an index beyond the length of the vector, R will return +a missing value:

+
+

R +

+
+x[6]
+
+
+

OUTPUT +

+
<NA> 
+  NA 
+
+

This is a vector of length one containing an NA, whose +name is also NA.

+

If we ask for the 0th element, we get an empty vector:

+
+

R +

+
+x[0]
+
+
+

OUTPUT +

+
named numeric(0)
+
+
+
+ +
+
+

Vector numbering in R starts at 1 +

+
+

In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1.

+
+
+
+

Skipping and removing elements +

+
+

If we use a negative number as the index of a vector, R will return +every element except for the one specified:

+
+

R +

+
+x[-2]
+
+
+

OUTPUT +

+
  a   c   d   e 
+5.4 7.1 4.8 7.5 
+
+

We can skip multiple elements:

+
+

R +

+
+x[c(-1, -5)]  # or x[-c(1,5)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+
+ +
+
+

Tip: Order of operations +

+
+

A common trip up for novices occurs when trying to skip slices of a +vector. It’s natural to try to negate a sequence like so:

+
+

R +

+
+x[-1:3]
+
+

This gives a somewhat cryptic error:

+
+

ERROR +

+
Error in x[-1:3]: only 0's may be mixed with negative subscripts
+
+

But remember the order of operations. : is really a +function. It takes its first argument as -1, and its second as 3, so +generates the sequence of numbers: c(-1, 0, 1, 2, 3).

+

The correct solution is to wrap that function call in brackets, so +that the - operator applies to the result:

+
+

R +

+
+x[-(1:3)]
+
+
+

OUTPUT +

+
  d   e 
+4.8 7.5 
+
+
+
+
+

To remove elements from a vector, we need to assign the result back +into the variable:

+
+

R +

+
+x <- x[-4]
+x
+
+
+

OUTPUT +

+
  a   b   c   e 
+5.4 6.2 7.1 7.5 
+
+
+
+ +
+
+

Challenge 1 +

+
+

Given the following code:

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+

Come up with at least 2 different commands that will produce the +following output:

+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+

After you find 2 different commands, compare notes with your +neighbour. Did you have different strategies?

+
+
+
+
+
+ +
+
+
+

R +

+
+x[2:4]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+

R +

+
+x[-c(1,5)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+

R +

+
+x[c(2,3,4)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+
+
+
+

Subsetting by name +

+
+

We can extract elements by using their name, instead of extracting by +index:

+
+

R +

+
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
+x[c("a", "c")]
+
+
+

OUTPUT +

+
  a   c 
+5.4 7.1 
+
+

This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same!

+

Subsetting through other logical operations +

+
+

We can also use any logical vector to subset:

+
+

R +

+
+x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
+
+
+

OUTPUT +

+
  c   e 
+7.1 7.5 
+
+

Since comparison operators (e.g. >, +<, ==) evaluate to logical vectors, we can +also use them to succinctly subset vectors: the following statement +gives the same result as the previous one.

+
+

R +

+
+x[x > 7]
+
+
+

OUTPUT +

+
  c   e 
+7.1 7.5 
+
+

Breaking it down, this statement first evaluates x>7, +generating a logical vector +c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the +elements of x corresponding to the TRUE +values.

+

We can use == to mimic the previous method of indexing +by name (remember you have to use == rather than += for comparisons):

+
+

R +

+
+x[names(x) == "a"]
+
+
+

OUTPUT +

+
  a 
+5.4 
+
+
+
+ +
+
+

Tip: Combining logical conditions +

+
+

We often want to combine multiple logical criteria. For example, we +might want to find all the countries that are located in Asia +or Europe and have life expectancies +within a certain range. Several operations for combining logical vectors +exist in R:

+
    +
  • +&, the “logical AND” operator: returns +TRUE if both the left and right are TRUE.
  • +
  • +|, the “logical OR” operator: returns +TRUE, if either the left or right (or both) are +TRUE.
  • +
+

You may sometimes see && and || +instead of & and |. These two-character +operators only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them for programming, i.e. deciding +whether to execute a statement.

+
    +
  • +!, the “logical NOT” operator: converts +TRUE to FALSE and FALSE to +TRUE. It can negate a single logical condition (eg +!TRUE becomes FALSE), or a whole vector of +conditions(eg !c(TRUE, FALSE) becomes +c(FALSE, TRUE)).
  • +
+

Additionally, you can compare the elements within a single vector +using the all function (which returns TRUE if +every element of the vector is TRUE) and the +any function (which returns TRUE if one or +more elements of the vector are TRUE).

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Given the following code:

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+

Write a subsetting command to return the values in x that are greater +than 4 and less than 7.

+
+
+
+
+
+ +
+
+
+

R +

+
+x_subset <- x[x<7 & x>4]
+print(x_subset)
+
+
+

OUTPUT +

+
  a   b   d 
+5.4 6.2 4.8 
+
+
+
+
+
+
+
+ +
+
+

Tip: Non-unique names +

+
+

You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have the +same name — although R tries to avoid this — but row names must be +unique.) Consider these examples:

+
+

R +

+
+x <- 1:3
+x
+
+
+

OUTPUT +

+
[1] 1 2 3
+
+
+

R +

+
+names(x) <- c('a', 'a', 'a')
+x
+
+
+

OUTPUT +

+
a a a 
+1 2 3 
+
+
+

R +

+
+x['a']  # only returns first value
+
+
+

OUTPUT +

+
a 
+1 
+
+
+

R +

+
+x[names(x) == 'a']  # returns all three values
+
+
+

OUTPUT +

+
a a a 
+1 2 3 
+
+
+
+
+
+
+ +
+
+

Tip: Getting help for operators +

+
+

Remember you can search for help on operators by wrapping them in +quotes: help("%in%") or ?"%in%".

+
+
+
+

Skipping named elements +

+
+

Skipping or removing named elements is a little harder. If we try to +skip one named element by negating the string, R complains (slightly +obscurely) that it doesn’t know how to take the negative of a +string:

+
+

R +

+
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
+x[-"a"]
+
+
+

ERROR +

+
Error in -"a": invalid argument to unary operator
+
+

However, we can use the != (not-equals) operator to +construct a logical vector that will do what we want:

+
+

R +

+
+x[names(x) != "a"]
+
+
+

OUTPUT +

+
  b   c   d   e 
+6.2 7.1 4.8 7.5 
+
+

Skipping multiple named indices is a little bit harder still. Suppose +we want to drop the "a" and "c" elements, so +we try this:

+
+

R +

+
+x[names(x)!=c("a","c")]
+
+
+

WARNING +

+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+

OUTPUT +

+
  b   c   d   e 
+6.2 7.1 4.8 7.5 
+
+

R did something, but it gave us a warning that we ought to +pay attention to - and it apparently gave us the wrong answer +(the "c" element is still included in the vector)!

+

So what does != actually do in this case? That’s an +excellent question.

+
+

Recycling +

+

Let’s take a look at the comparison component of this code:

+
+

R +

+
+names(x) != c("a", "c")
+
+
+

WARNING +

+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+

OUTPUT +

+
[1] FALSE  TRUE  TRUE  TRUE  TRUE
+
+

Why does R give TRUE as the third element of this +vector, when names(x)[3] != "c" is obviously false? When +you use !=, R tries to compare each element of the left +argument with the corresponding element of its right argument. What +happens when you compare vectors of different lengths?

+
Inequality testing

When one vector is shorter than the other, it gets +recycled:

+
Inequality testing: results of recycling

In this case R repeats c("a", "c") as +many times as necessary to match names(x), i.e. we get +c("a","c","a","c","a"). Since the recycled "a" +doesn’t match the third element of names(x), the value of +!= is TRUE. Because in this case the longer +vector length (5) isn’t a multiple of the shorter vector length (2), R +printed a warning message. If we had been unlucky and +names(x) had contained six elements, R would +silently have done the wrong thing (i.e., not what we intended +it to do). This recycling rule can can introduce hard-to-find and subtle +bugs!

+

The way to get R to do what we really want (match each +element of the left argument with all of the elements of the +right argument) it to use the %in% operator. The +%in% operator goes through each element of its left +argument, in this case the names of x, and asks, “Does this +element occur in the second argument?”. Here, since we want to +exclude values, we also need a ! operator to +change “in” to “not in”:

+
+

R +

+
+x[! names(x) %in% c("a","c") ]
+
+
+

OUTPUT +

+
  b   d   e 
+6.2 4.8 7.5 
+
+
+
+ +
+
+

Challenge 3 +

+
+

Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains country and continent variables, but +no information between these two scales. Suppose we want to pull out +information from southeast Asia: how do we set up an operation to +produce a logical vector that is TRUE for all of the +countries in southeast Asia and FALSE otherwise?

+

Suppose you have these data:

+
+

R +

+
+seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries <- unique(as.character(gapminder$country))
+
+

There’s a wrong way (using only ==), which will give you +a warning; a clunky way (using the logical operators == and +|); and an elegant way (using %in%). See +whether you can come up with all three and explain how they (don’t) +work.

+
+
+
+
+
+ +
+
+
    +
  • The wrong way to do this problem is +countries==seAsia. This gives a warning +("In countries == seAsia : longer object length is not a multiple of shorter object length") +and the wrong answer (a vector of all FALSE values), +because none of the recycled values of seAsia happen to +line up correctly with matching values in country.
  • +
  • The clunky (but technically correct) way to do this +problem is
  • +
+
+

R +

+
+ (countries=="Myanmar" | countries=="Thailand" |
+ countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")
+
+

(or countries==seAsia[1] | countries==seAsia[2] | ...). +This gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?).

+
    +
  • The best way to do this problem is +countries %in% seAsia, which is both correct and easy to +type (and read).
  • +
+
+
+
+
+
+

Handling special values +

+
+

At some point you will encounter functions in R that cannot handle +missing, infinite, or undefined data.

+

There are a number of special functions you can use to filter out +this data:

+
    +
  • +is.na will return all positions in a vector, matrix, or +data.frame containing NA (or NaN)
  • +
  • likewise, is.nan, and is.infinite will do +the same for NaN and Inf.
  • +
  • +is.finite will return all positions in a vector, +matrix, or data.frame that do not contain NA, +NaN or Inf.
  • +
  • +na.omit will filter out all missing values from a +vector
  • +

Factor subsetting +

+
+

Now that we’ve explored the different ways to subset vectors, how do +we subset the other data structures?

+

Factor subsetting works the same way as vector subsetting.

+
+

R +

+
+f <- factor(c("a", "a", "b", "c", "c", "d"))
+f[f == "a"]
+
+
+

OUTPUT +

+
[1] a a
+Levels: a b c d
+
+
+

R +

+
+f[f %in% c("b", "c")]
+
+
+

OUTPUT +

+
[1] b c c
+Levels: a b c d
+
+
+

R +

+
+f[1:3]
+
+
+

OUTPUT +

+
[1] a a b
+Levels: a b c d
+
+

Skipping elements will not remove the level even if no more of that +category exists in the factor:

+
+

R +

+
+f[-3]
+
+
+

OUTPUT +

+
[1] a a c c d
+Levels: a b c d
+
+

Matrix subsetting +

+
+

Matrices are also subsetted using the [ function. In +this case it takes two arguments: the first applying to the rows, the +second to its columns:

+
+

R +

+
+set.seed(1)
+m <- matrix(rnorm(6*4), ncol=4, nrow=6)
+m[3:4, c(3,1)]
+
+
+

OUTPUT +

+
            [,1]       [,2]
+[1,]  1.12493092 -0.8356286
+[2,] -0.04493361  1.5952808
+
+

You can leave the first or second arguments blank to retrieve all the +rows or columns respectively:

+
+

R +

+
+m[, c(3,4)]
+
+
+

OUTPUT +

+
            [,1]        [,2]
+[1,] -0.62124058  0.82122120
+[2,] -2.21469989  0.59390132
+[3,]  1.12493092  0.91897737
+[4,] -0.04493361  0.78213630
+[5,] -0.01619026  0.07456498
+[6,]  0.94383621 -1.98935170
+
+

If we only access one row or column, R will automatically convert the +result to a vector:

+
+

R +

+
+m[3,]
+
+
+

OUTPUT +

+
[1] -0.8356286  0.5757814  1.1249309  0.9189774
+
+

If you want to keep the output as a matrix, you need to specify a +third argument; drop = FALSE:

+
+

R +

+
+m[3, , drop=FALSE]
+
+
+

OUTPUT +

+
           [,1]      [,2]     [,3]      [,4]
+[1,] -0.8356286 0.5757814 1.124931 0.9189774
+
+

Unlike vectors, if we try to access a row or column outside of the +matrix, R will throw an error:

+
+

R +

+
+m[, c(3,6)]
+
+
+

ERROR +

+
Error in m[, c(3, 6)]: subscript out of bounds
+
+
+
+ +
+
+

Tip: Higher dimensional arrays +

+
+

when dealing with multi-dimensional arrays, each argument to +[ corresponds to a dimension. For example, a 3D array, the +first three arguments correspond to the rows, columns, and depth +dimension.

+
+
+
+

Because matrices are vectors, we can also subset using only one +argument:

+
+

R +

+
+m[5]
+
+
+

OUTPUT +

+
[1] 0.3295078
+
+

This usually isn’t useful, and often confusing to read. However it is +useful to note that matrices are laid out in column-major +format by default. That is the elements of the vector are arranged +column-wise:

+
+

R +

+
+matrix(1:6, nrow=2, ncol=3)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3]
+[1,]    1    3    5
+[2,]    2    4    6
+
+

If you wish to populate the matrix by row, use +byrow=TRUE:

+
+

R +

+
+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3]
+[1,]    1    2    3
+[2,]    4    5    6
+
+

Matrices can also be subsetted using their rownames and column names +instead of their row and column indices.

+
+
+ +
+
+

Challenge 4 +

+
+

Given the following code:

+
+

R +

+
+m <- matrix(1:18, nrow=3, ncol=6)
+print(m)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    1    4    7   10   13   16
+[2,]    2    5    8   11   14   17
+[3,]    3    6    9   12   15   18
+
+
    +
  1. Which of the following commands will extract the values 11 and +14?
  2. +
+

A. m[2,4,2,5]

+

B. m[2:5]

+

C. m[4:5,2]

+

D. m[2,c(4,5)]

+
+
+
+
+
+ +
+
+

D

+
+
+
+
+

List subsetting +

+
+

Now we’ll introduce some new subsetting operators. There are three +functions used to subset lists. We’ve already seen these when learning +about atomic vectors and matrices: [, [[, and +$.

+

Using [ will always return a list. If you want to +subset a list, but not extract an element, then you +will likely use [.

+
+

R +

+
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+xlist[1]
+
+
+

OUTPUT +

+
$a
+[1] "Software Carpentry"
+
+

This returns a list with one element.

+

We can subset elements of a list exactly the same way as atomic +vectors using [. Comparison operations however won’t work +as they’re not recursive, they will try to condition on the data +structures in each element of the list, not the individual elements +within those data structures.

+
+

R +

+
+xlist[1:2]
+
+
+

OUTPUT +

+
$a
+[1] "Software Carpentry"
+
+$b
+ [1]  1  2  3  4  5  6  7  8  9 10
+
+

To extract individual elements of a list, you need to use the +double-square bracket function: [[.

+
+

R +

+
+xlist[[1]]
+
+
+

OUTPUT +

+
[1] "Software Carpentry"
+
+

Notice that now the result is a vector, not a list.

+

You can’t extract more than one element at once:

+
+

R +

+
+xlist[[1:2]]
+
+
+

ERROR +

+
Error in xlist[[1:2]]: subscript out of bounds
+
+

Nor use it to skip elements:

+
+

R +

+
+xlist[[-1]]
+
+
+

ERROR +

+
Error in xlist[[-1]]: invalid negative subscript in get1index <real>
+
+

But you can use names to both subset and extract elements:

+
+

R +

+
+xlist[["a"]]
+
+
+

OUTPUT +

+
[1] "Software Carpentry"
+
+

The $ function is a shorthand way for extracting +elements by name:

+
+

R +

+
+xlist$data
+
+
+

OUTPUT +

+
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
+Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+
+
+
+ +
+
+

Challenge 5 +

+
+

Given the following list:

+
+

R +

+
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+
+

Using your knowledge of both list and vector subsetting, extract the +number 2 from xlist. Hint: the number 2 is contained within the “b” item +in the list.

+
+
+
+
+
+ +
+
+
+

R +

+
+xlist$b[2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+

R +

+
+xlist[[2]][2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+

R +

+
+xlist[["b"]][2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+
+
+
+
+
+ +
+
+

Challenge 6 +

+
+

Given a linear model:

+
+

R +

+
+mod <- aov(pop ~ lifeExp, data=gapminder)
+
+

Extract the residual degrees of freedom (hint: +attributes() will help you)

+
+
+
+
+
+ +
+
+
+

R +

+
+attributes(mod) ## `df.residual` is one of the names of `mod`
+
+
+

R +

+
+mod$df.residual
+
+
+
+
+
+

Data frames +

+
+

Remember the data frames are lists underneath the hood, so similar +rules apply. However they are also two dimensional objects:

+

[ with one argument will act the same way as for lists, +where each list element corresponds to a column. The resulting object +will be a data frame:

+
+

R +

+
+head(gapminder[3])
+
+
+

OUTPUT +

+
       pop
+1  8425333
+2  9240934
+3 10267083
+4 11537966
+5 13079460
+6 14880372
+
+

Similarly, [[ will act to extract a single +column:

+
+

R +

+
+head(gapminder[["lifeExp"]])
+
+
+

OUTPUT +

+
[1] 28.801 30.332 31.997 34.020 36.088 38.438
+
+

And $ provides a convenient shorthand to extract columns +by name:

+
+

R +

+
+head(gapminder$year)
+
+
+

OUTPUT +

+
[1] 1952 1957 1962 1967 1972 1977
+
+

With two arguments, [ behaves the same way as for +matrices:

+
+

R +

+
+gapminder[1:3,]
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+
+

If we subset a single row, the result will be a data frame (because +the elements are mixed types):

+
+

R +

+
+gapminder[3,]
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+
+

But for a single column the result will be a vector (this can be +changed with the third argument, drop = FALSE).

+
+
+ +
+
+

Challenge 7 +

+
+

Fix each of the following common data frame subsetting errors:

+
    +
  1. Extract observations collected for the year 1957
  2. +
+
+

R +

+
gapminder[gapminder$year = 1957,]
+
+
    +
  1. Extract all columns except 1 through to 4
  2. +
+
+

R +

+
+gapminder[,-1:4]
+
+
    +
  1. Extract the rows where the life expectancy is longer the 80 +years
  2. +
+
+

R +

+
+gapminder[gapminder$lifeExp > 80]
+
+
    +
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +
+
+

R +

+
+gapminder[1, 4, 5]
+
+
    +
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +
+
+

R +

+
+gapminder[gapminder$year == 2002 | 2007,]
+
+
+
+
+
+
+ +
+
+

Fix each of the following common data frame subsetting errors:

+
    +
  1. Extract observations collected for the year 1957
  2. +
+
+

R +

+
+# gapminder[gapminder$year = 1957,]
+gapminder[gapminder$year == 1957,]
+
+
    +
  1. Extract all columns except 1 through to 4
  2. +
+
+

R +

+
+# gapminder[,-1:4]
+gapminder[,-c(1:4)]
+
+
    +
  1. Extract the rows where the life expectancy is longer than 80 +years
  2. +
+
+

R +

+
+# gapminder[gapminder$lifeExp > 80]
+gapminder[gapminder$lifeExp > 80,]
+
+
    +
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +
+
+

R +

+
+# gapminder[1, 4, 5]
+gapminder[1, c(4, 5)]
+
+
    +
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +
+
+

R +

+
+# gapminder[gapminder$year == 2002 | 2007,]
+gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
+gapminder[gapminder$year %in% c(2002, 2007),]
+
+
+
+
+
+
+
+ +
+
+

Challenge 8 +

+
+
    +
  1. Why does gapminder[1:20] return an error? How does +it differ from gapminder[1:20, ]?

  2. +
  3. Create a new data.frame called +gapminder_small that only contains rows 1 through 9 and 19 +through 23. You can do this in one or two steps.

  4. +
+
+
+
+
+
+ +
+
+
    +
  1. gapminder is a data.frame so needs to be subsetted +on two dimensions. gapminder[1:20, ] subsets the data to +give the first 20 rows and all columns.

  2. +
  3. +
  4. +
+
+

R +

+
+gapminder_small <- gapminder[c(1:9, 19:23),]
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +
+
+
+
+

Content from Control Flow

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I make data-dependent choices in R?
  • +
  • How can I repeat operations in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Write conditional statements with if...else statements +and ifelse().
  • +
  • Write and understand for() loops.
  • +
+
+
+
+
+
+

Often when we’re coding we want to control the flow of our actions. +This can be done by setting actions to occur only if a condition or a +set of conditions are met. Alternatively, we can also set an action to +occur a particular number of times.

+

There are several ways you can control flow in R. For conditional +statements, the most commonly used approaches are the constructs:

+
+

R +

+
# if
+if (condition is true) {
+  perform action
+}
+
+# if ... else
+if (condition is true) {
+  perform action
+} else {  # that is, if the condition is false,
+  perform alternative action
+}
+
+

Say, for example, that we want R to print a message if a variable +x has a particular value:

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+}
+
+x
+
+
+

OUTPUT +

+
[1] 8
+
+

The print statement does not appear in the console because x is not +greater than 10. To print a different message for numbers less than 10, +we can add an else statement.

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else {
+  print("x is less than 10")
+}
+
+
+

OUTPUT +

+
[1] "x is less than 10"
+
+

You can also test multiple conditions by using +else if.

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else if (x > 5) {
+  print("x is greater than 5, but less than 10")
+} else {
+  print("x is less than 5")
+}
+
+
+

OUTPUT +

+
[1] "x is greater than 5, but less than 10"
+
+

Important: when R evaluates the condition inside +if() statements, it is looking for a logical element, i.e., +TRUE or FALSE. This can cause some headaches +for beginners. For example:

+
+

R +

+
+x  <-  4 == 3
+if (x) {
+  "4 equals 3"
+} else {
+  "4 does not equal 3"
+}
+
+
+

OUTPUT +

+
[1] "4 does not equal 3"
+
+

As we can see, the not equal message was printed because the vector x +is FALSE

+
+

R +

+
+x <- 4 == 3
+x
+
+
+

OUTPUT +

+
[1] FALSE
+
+
+
+ +
+
+

Challenge 1 +

+
+

Use an if() statement to print a suitable message +reporting whether there are any records from 2002 in the +gapminder dataset. Now do the same for 2012.

+
+
+
+
+
+ +
+
+

We will first see a solution to Challenge 1 which does not use the +any() function. We first obtain a logical vector describing +which element of gapminder$year is equal to +2002:

+
+

R +

+
+gapminder[(gapminder$year == 2002),]
+
+

Then, we count the number of rows of the data.frame +gapminder that correspond to the 2002:

+
+

R +

+
+rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])
+
+

The presence of any record for the year 2002 is equivalent to the +request that rows2002_number is one or more:

+
+

R +

+
+rows2002_number >= 1
+
+

Putting all together, we obtain:

+
+

R +

+
+if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
+   print("Record(s) for the year 2002 found.")
+}
+
+

All this can be done more quickly with any(). The +logical condition can be expressed as:

+
+

R +

+
+if(any(gapminder$year == 2002)){
+   print("Record(s) for the year 2002 found.")
+}
+
+
+
+
+
+

Did anyone get a warning message like this?

+
+

ERROR +

+
Error in if (gapminder$year == 2012) {: the condition has length > 1
+
+

The if() function only accepts singular (of length 1) +inputs, and therefore returns an error when you use it with a vector. +The if() function will still run, but will only evaluate +the condition in the first element of the vector. Therefore, to use the +if() function, you need to make sure your input is singular +(of length 1).

+
+
+ +
+
+

Tip: Built in ifelse() +function +

+
+

R accepts both if() and +else if() statements structured as outlined above, but also +statements using R’s built-in ifelse() +function. This function accepts both singular and vector inputs and is +structured as follows:

+
+

R +

+
# ifelse function
+ifelse(condition is true, perform action, perform alternative action)
+
+

where the first argument is the condition or a set of conditions to +be met, the second argument is the statement that is evaluated when the +condition is TRUE, and the third statement is the statement +that is evaluated when the condition is FALSE.

+
+

R +

+
+y <- -3
+ifelse(y < 0, "y is a negative number", "y is either positive or zero")
+
+
+

OUTPUT +

+
[1] "y is a negative number"
+
+
+
+
+
+
+ +
+
+

Tip: any() and +all() +

+
+

The any() function will return TRUE if at +least one TRUE value is found within a vector, otherwise it +will return FALSE. This can be used in a similar way to the +%in% operator. The function all(), as the name +suggests, will only return TRUE if all values in the vector +are TRUE.

+
+
+
+

Repeating operations +

+
+

If you want to iterate over a set of values, when the order of +iteration is important, and perform the same operation on each, a +for() loop will do the job. We saw for() loops +in the shell +lessons earlier. This is the most flexible of looping operations, +but therefore also the hardest to use correctly. In general, the advice +of many R users would be to learn about for() +loops, but to avoid using for() loops unless the order of +iteration is important: i.e. the calculation at each iteration depends +on the results of previous iterations. If the order of iteration is not +important, then you should learn about vectorized alternatives, such as +the purrr package, as they pay off in computational +efficiency.

+

The basic structure of a for() loop is:

+
+

R +

+
for (iterator in set of values) {
+  do a thing
+}
+
+

For example:

+
+

R +

+
+for (i in 1:10) {
+  print(i)
+}
+
+
+

OUTPUT +

+
[1] 1
+[1] 2
+[1] 3
+[1] 4
+[1] 5
+[1] 6
+[1] 7
+[1] 8
+[1] 9
+[1] 10
+
+

The 1:10 bit creates a vector on the fly; you can +iterate over any other vector as well.

+

We can use a for() loop nested within another +for() loop to iterate over two things at once.

+
+

R +

+
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    print(paste(i,j))
+  }
+}
+
+
+

OUTPUT +

+
[1] "1 a"
+[1] "1 b"
+[1] "1 c"
+[1] "1 d"
+[1] "1 e"
+[1] "2 a"
+[1] "2 b"
+[1] "2 c"
+[1] "2 d"
+[1] "2 e"
+[1] "3 a"
+[1] "3 b"
+[1] "3 c"
+[1] "3 d"
+[1] "3 e"
+[1] "4 a"
+[1] "4 b"
+[1] "4 c"
+[1] "4 d"
+[1] "4 e"
+[1] "5 a"
+[1] "5 b"
+[1] "5 c"
+[1] "5 d"
+[1] "5 e"
+
+

We notice in the output that when the first index (i) is +set to 1, the second index (j) iterates through its full +set of indices. Once the indices of j have been iterated +through, then i is incremented. This process continues +until the last index has been used for each for() loop.

+

Rather than printing the results, we could write the loop output to a +new object.

+
+

R +

+
+output_vector <- c()
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    temp_output <- paste(i, j)
+    output_vector <- c(output_vector, temp_output)
+  }
+}
+output_vector
+
+
+

OUTPUT +

+
 [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b"
+[13] "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b" "5 c" "5 d"
+[25] "5 e"
+
+

This approach can be useful, but ‘growing your results’ (building the +result object incrementally) is computationally inefficient, so avoid it +when you are iterating through a lot of values.

+
+
+ +
+
+

Tip: don’t grow your results +

+
+

One of the biggest things that trips up novices and experienced R +users alike, is building a results object (vector, list, matrix, data +frame) as your for loop progresses. Computers are very bad at handling +this, so your calculations can very quickly slow to a crawl. It’s much +better to define an empty results object before hand of appropriate +dimensions, rather than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location.

+
+
+
+

A better way is to define your (empty) output object before filling +in the values. For this example, it looks more involved, but is still +more efficient.

+
+

R +

+
+output_matrix <- matrix(nrow = 5, ncol = 5)
+j_vector <- c('a', 'b', 'c', 'd', 'e')
+for (i in 1:5) {
+  for (j in 1:5) {
+    temp_j_value <- j_vector[j]
+    temp_output <- paste(i, temp_j_value)
+    output_matrix[i, j] <- temp_output
+  }
+}
+output_vector2 <- as.vector(output_matrix)
+output_vector2
+
+
+

OUTPUT +

+
 [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c" "2 c"
+[13] "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e" "3 e" "4 e"
+[25] "5 e"
+
+
+
+ +
+
+

Tip: While loops +

+
+

Sometimes you will find yourself needing to repeat an operation as +long as a certain condition is met. You can do this with a +while() loop.

+
+

R +

+
while(this condition is true){
+  do a thing
+}
+
+

R will interpret a condition being met as “TRUE”.

+

As an example, here’s a while loop that generates random numbers from +a uniform distribution (the runif() function) between 0 and +1 until it gets one that’s less than 0.1.

+
+

R +

+
+z <- 1
+while(z > 0.1){
+  z <- runif(1)
+  cat(z, "\n")
+}
+
+

while() loops will not always be appropriate. You have +to be particularly careful that you don’t end up stuck in an infinite +loop because your condition is always met and hence the while statement +never terminates.

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Compare the objects output_vector and +output_vector2. Are they the same? If not, why not? How +would you change the last block of code to make +output_vector2 the same as output_vector?

+
+
+
+
+
+ +
+
+

We can check whether the two vectors are identical using the +all() function:

+
+

R +

+
+all(output_vector == output_vector2)
+
+

However, all the elements of output_vector can be found +in output_vector2:

+
+

R +

+
+all(output_vector %in% output_vector2)
+
+

and vice versa:

+
+

R +

+
+all(output_vector2 %in% output_vector)
+
+

therefore, the element in output_vector and +output_vector2 are just sorted in a different order. This +is because as.vector() outputs the elements of an input +matrix going over its column. Taking a look at +output_matrix, we can notice that we want its elements by +rows. The solution is to transpose the output_matrix. We +can do it either by calling the transpose function t() or +by inputting the elements in the right order. The first solution +requires to change the original

+
+

R +

+
+output_vector2 <- as.vector(output_matrix)
+
+

into

+
+

R +

+
+output_vector2 <- as.vector(t(output_matrix))
+
+

The second solution requires to change

+
+

R +

+
+output_matrix[i, j] <- temp_output
+
+

into

+
+

R +

+
+output_matrix[j, i] <- temp_output
+
+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Write a script that loops through the gapminder data by +continent and prints out whether the mean life expectancy is smaller or +larger than 50 years.

+
+
+
+
+
+ +
+
+

Step 1: We want to make sure we can extract all the +unique values of the continent vector

+
+

R +

+
+gapminder <- read.csv("data/gapminder_data.csv")
+unique(gapminder$continent)
+
+

Step 2: We also need to loop over each of these +continents and calculate the average life expectancy for each +subset of data. We can do that as follows:

+
    +
  1. Loop over each of the unique values of ‘continent’
  2. +
  3. For each value of continent, create a temporary variable storing +that subset
  4. +
  5. Return the calculated life expectancy to the user by printing the +output:
  6. +
+
+

R +

+
+for (iContinent in unique(gapminder$continent)) {
+  tmp <- gapminder[gapminder$continent == iContinent, ]
+  cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n")
+  rm(tmp)
+}
+
+

Step 3: The exercise only wants the output printed +if the average life expectancy is less than 50 or greater than 50. So we +need to add an if() condition before printing, which +evaluates whether the calculated average life expectancy is above or +below a threshold, and prints an output conditional on the result. We +need to amend (3) from above:

+

3a. If the calculated life expectancy is less than some threshold (50 +years), return the continent and a statement that life expectancy is +less than threshold, otherwise return the continent and a statement that +life expectancy is greater than threshold:

+
+

R +

+
+thresholdValue <- 50
+
+for (iContinent in unique(gapminder$continent)) {
+   tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"])
+
+   if (tmp < thresholdValue){
+       cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+   } else {
+       cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+   } # end if else condition
+   rm(tmp)
+} # end for loop
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Modify the script from Challenge 3 to loop over each country. This +time print out whether the life expectancy is smaller than 50, between +50 and 70, or greater than 70.

+
+
+
+
+
+ +
+
+

We modify our solution to Challenge 3 by now adding two thresholds, +lowerThreshold and upperThreshold and +extending our if-else statements:

+
+

R +

+
+ lowerThreshold <- 50
+ upperThreshold <- 70
+
+for (iCountry in unique(gapminder$country)) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+    if(tmp < lowerThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
+    } else if(tmp > lowerThreshold && tmp < upperThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
+    } else {
+        cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
+    }
+    rm(tmp)
+}
+
+
+
+
+
+
+
+ +
+
+

Challenge 5 - Advanced +

+
+

Write a script that loops over each country in the +gapminder dataset, tests whether the country starts with a +‘B’, and graphs life expectancy against time as a line graph if the mean +life expectancy is under 50 years.

+
+
+
+
+
+ +
+
+

We will use the grep() command that was introduced in +the Unix +Shell lesson to find countries that start with “B.” Lets understand +how to do this first. Following from the Unix shell section we may be +tempted to try the following

+
+

R +

+
+grep("^B", unique(gapminder$country))
+
+

But when we evaluate this command it returns the indices of the +factor variable country that start with “B.” To get the +values, we must add the value=TRUE option to the +grep() command:

+
+

R +

+
+grep("^B", unique(gapminder$country), value = TRUE)
+
+

We will now store these countries in a variable called +candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each +country, and if the average life expectancy is less than 50 we use +base-plot to plot the evolution of average life expectancy using +with() and subset():

+
+

R +

+
+thresholdValue <- 50
+candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE)
+
+for (iCountry in candidateCountries) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+    if (tmp < thresholdValue) {
+        cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+
+        with(subset(gapminder, country == iCountry),
+                plot(year, lifeExp,
+                     type = "o",
+                     main = paste("Life Expectancy in", iCountry, "over time"),
+                     ylab = "Life Expectancy",
+                     xlab = "Year"
+                     ) # end plot
+             ) # end with
+    } # end if
+    rm(tmp)
+} # end for loop
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +
+
+
+
+

Content from Creating Publication-Quality Graphics with ggplot2

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I create publication-quality graphics in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to use ggplot2 to generate publication-quality +graphics.
  • +
  • To apply geometry, aesthetic, and statistics layers to a ggplot +plot.
  • +
  • To manipulate the aesthetics of a plot using different colors, +shapes, and lines.
  • +
  • To improve data visualization through transforming scales and +paneling by group.
  • +
  • To save a plot created with ggplot to disk.
  • +
+
+
+
+
+
+

Plotting our data is one of the best ways to quickly explore it and +the various relationships between variables.

+

There are three main plotting systems in R, the base plotting +system, the lattice +package, and the ggplot2 +package.

+

Today we’ll be learning about the ggplot2 package, because it is the +most effective for creating publication-quality graphics.

+

ggplot2 is built on the grammar of graphics, the idea that any plot +can be built from the same set of components: a data +set, mapping aesthetics, and graphical +layers:

+
    +
  • Data sets are the data that you, the user, +provide.

  • +
  • Mapping aesthetics are what connect the data to +the graphics. They tell ggplot2 how to use your data to affect how the +graph looks, such as changing what is plotted on the X or Y axis, or the +size or color of different data points.

  • +
  • Layers are the actual graphical output from +ggplot2. Layers determine what kinds of plot are shown (scatterplot, +histogram, etc.), the coordinate system used (rectangular, polar, +others), and other important aspects of the plot. The idea of layers of +graphics may be familiar to you if you have used image editing programs +like Photoshop, Illustrator, or Inkscape.

  • +
+

Let’s start off building an example using the gapminder data from +earlier. The most basic function is ggplot, which lets R +know that we’re creating a new plot. Any of the arguments we give the +ggplot function are the global options for the +plot: they apply to all layers on the plot.

+
+

R +

+
+library("ggplot2")
+ggplot(data = gapminder)
+
+
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to +show on our figure. This is not enough information for +ggplot to actually draw anything. It only creates a blank +slate for other elements to be added to.

+

Now we’re going to add in the mapping aesthetics +using the aes function. aes tells +ggplot how variables in the data map to +aesthetic properties of the figure, such as which columns of +the data should be used for the x and +y locations.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
+
+
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

Here we told ggplot we want to plot the “gdpPercap” +column of the gapminder data frame on the x-axis, and the “lifeExp” +column on the y-axis. Notice that we didn’t need to explicitly pass +aes these columns +(e.g. x = gapminder[, "gdpPercap"]), this is because +ggplot is smart enough to know to look in the +data for that column!

+

The final part of making our plot is to tell ggplot how +we want to visually represent the data. We do this by adding a new +layer to the plot using one of the +geom functions.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
+
+
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

Here we used geom_point, which tells ggplot +we want to visually represent the relationship between +x and y as a scatterplot of +points.

+
+
+ +
+
+

Challenge 1 +

+
+

Modify the example so that the figure shows how life expectancy has +changed over time:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()
+
+

Hint: the gapminder dataset has a column called “year”, which should +appear on the x-axis.

+
+
+
+
+
+ +
+
+

Here is one possible solution:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
+
+
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +
+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

In the previous examples and challenge we’ve used the +aes function to tell the scatterplot geom +about the x and y locations of each +point. Another aesthetic property we can modify is the point +color. Modify the code from the previous challenge to +color the points by the “continent” column. What trends +do you see in the data? Are they what you expected?

+
+
+
+
+
+ +
+
+

The solution presented below adds color=continent to the +call of the aes function. The general trend seems to +indicate an increased life expectancy over the years. On continents with +stronger economies we find a longer life expectancy.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_point()
+
+
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +
+
+
+
+
+

Layers +

+
+

Using a scatterplot probably isn’t the best for visualizing change +over time. Instead, let’s tell ggplot to visualize the data +as a line plot:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
+  geom_line()
+
+

Instead of adding a geom_point layer, we’ve added a +geom_line layer.

+

However, the result doesn’t look quite as we might have expected: it +seems to be jumping around a lot in each continent. Let’s try to +separate the data by country, plotting one line for each country:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line()
+
+

We’ve added the group aesthetic, which +tells ggplot to draw a line for each country.

+

But what if we want to visualize both lines and points on the plot? +We can add another layer to the plot:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line() + geom_point()
+
+

It’s important to note that each layer is drawn on top of the +previous layer. In this example, the points have been drawn on top +of the lines. Here’s a demonstration:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+  geom_line(mapping = aes(color=continent)) + geom_point()
+
+

In this example, the aesthetic mapping of +color has been moved from the global plot options in +ggplot to the geom_line layer so it no longer +applies to the points. Now we can clearly see that the points are drawn +on top of the lines.

+
+
+ +
+
+

Tip: Setting an aesthetic to a value instead +of a mapping +

+
+

So far, we’ve seen how to use an aesthetic (such as +color) as a mapping to a variable in the data. +For example, when we use +geom_line(mapping = aes(color=continent)), ggplot will give +a different color to each continent. But what if we want to change the +color of all lines to blue? You may think that +geom_line(mapping = aes(color="blue")) should work, but it +doesn’t. Since we don’t want to create a mapping to a specific variable, +we can move the color specification outside of the aes() +function, like this: geom_line(color="blue").

+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Switch the order of the point and line layers from the previous +example. What happened?

+
+
+
+
+
+ +
+
+

The lines now get drawn over the points!

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_point() + geom_line(mapping = aes(color=continent))
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+
+
+
+
+

Transformations and statistics +

+
+

ggplot2 also makes it easy to overlay statistical models over the +data. To demonstrate we’ll go back to our first example:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
+
+

Currently it’s hard to see the relationship between the points due to +some strong outliers in GDP per capita. We can change the scale of units +on the x axis using the scale functions. These control the +mapping between the data values and visual values of an aesthetic. We +can also modify the transparency of the points, using the alpha +function, which is especially helpful when you have a large amount of +data which is very clustered.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10()
+
+
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

The scale_x_log10 function applied a transformation to +the coordinate system of the plot, so that each multiple of 10 is evenly +spaced from left to right. For example, a GDP per capita of 1,000 is the +same horizontal distance away from a value of 10,000 as the 10,000 value +is from 100,000. This helps to visualize the spread of the data along +the x-axis.

+
+
+ +
+
+

Tip Reminder: Setting an aesthetic to a value +instead of a mapping +

+
+

Notice that we used geom_point(alpha = 0.5). As the +previous tip mentioned, using a setting outside of the +aes() function will cause this value to be used for all +points, which is what we want in this case. But just like any other +aesthetic setting, alpha can also be mapped to a variable in +the data. For example, we can give a different transparency to each +continent with +geom_point(mapping = aes(alpha = continent)).

+
+
+
+

We can fit a simple relationship to the data by adding another layer, +geom_smooth:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the +size aesthetic in the geom_smooth +layer:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", size=1.5)
+
+
+

WARNING +

+
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+generated.
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we +set the size aesthetic by passing it as an +argument to geom_smooth. Previously in the lesson we’ve +used the aes function to define a mapping between +data variables and their visual representation.

+
+
+ +
+
+

Challenge 4a +

+
+

Modify the color and size of the points on the point layer in the +previous example.

+

Hint: do not use the aes function.

+
+
+
+
+
+ +
+
+

Here a possible solution: Notice that the color argument +is supplied outside of the aes() function. This means that +it applies to all data points on the graph and is not related to a +specific variable.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(size=3, color="orange") + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+
+
+
+
+
+
+ +
+
+

Challenge 4b +

+
+

Modify your solution to Challenge 4a so that the points are now a +different shape and are colored by continent with new trendlines. Hint: +The color argument can be used inside the aesthetic.

+
+
+
+
+
+ +
+
+

Here is a possible solution: Notice that supplying the +color argument inside the aes() functions +enables you to connect it to a certain variable. The shape +argument, as you can see, modifies all data points the same way (it is +outside the aes() call) while the color +argument which is placed inside the aes() call modifies a +point’s color based on its continent value.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
+ geom_point(size=3, shape=17) + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
+
+
+
+

Multi-panel figures +

+
+

Earlier we visualized the change in life expectancy over time across +all countries in one plot. Alternatively, we can split this out over +multiple panels by adding a layer of facet panels.

+
+
+ +
+
+

Tip +

+
+

We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to clutter +the figure. Note that we apply a “theme” definition to rotate the x-axis +labels to maintain readability. Nearly everything in ggplot2 is +customizable.

+
+
+
+
+

R +

+
+americas <- gapminder[gapminder$continent == "Americas",]
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

The facet_wrap layer took a “formula” as its argument, +denoted by the tilde (~). This tells R to draw a panel for each unique +value in the country column of the gapminder dataset.

+

Modifying text +

+
+

To clean this figure up for a publication we need to change some of +the text elements. The x-axis is too cluttered, and the y axis should +read “Life expectancy”, rather than the column name in the data +frame.

+

We can do this by adding a couple of different layers. The +theme layer controls the axis text, and overall text +size. Labels for the axes, plot title and any legend can be set using +the labs function. Legend titles are set using the same +names we used in the aes specification. Thus below the +color legend title is set using color = "Continent", while +the title of a fill legend would be set using +fill = "MyTitle".

+
+

R +

+
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+
+

Exporting the plot +

+
+

The ggsave() function allows you to export a plot +created with ggplot. You can specify the dimension and resolution of +your plot by adjusting the appropriate arguments (width, +height and dpi) to create high quality +graphics for publication. In order to save the plot from above, we first +assign it to a variable lifeExp_plot, then tell +ggsave to save that plot in png format to a +directory called results. (Make sure you have a +results/ folder in your working directory.)

+
+

R +

+
+lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+
+ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")
+
+

There are two nice things about ggsave. First, it +defaults to the last plot, so if you omit the plot argument +it will automatically save the last plot you created with +ggplot. Secondly, it tries to determine the format you want +to save your plot in from the file extension you provide for the +filename (for example .png or .pdf). If you +need to, you can specify the format explicitly in the +device argument.

+

This is a taste of what you can do with ggplot2. RStudio provides a +really useful cheat +sheet of the different layers available, and more extensive +documentation is available on the ggplot2 website. All +RStudio cheat sheets can be found here. Finally, +if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow +with reusable code to modify!

+
+
+ +
+
+

Challenge 5 +

+
+

Generate boxplots to compare life expectancy between the different +continents during the available years.

+

Advanced:

+
    +
  • Rename y axis as Life Expectancy.
  • +
  • Remove x axis labels.
  • +
+
+
+
+
+
+ +
+
+

Here a possible solution: xlab() and ylab() +set labels for the x and y axes, respectively The axis title, text and +ticks are attributes of the theme and must be modified within a +theme() call.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
+ geom_boxplot() + facet_wrap(~year) +
+ ylab("Life Expectancy") +
+ theme(axis.title.x=element_blank(),
+       axis.text.x = element_blank(),
+       axis.ticks.x = element_blank())
+
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +
+
+
+
+

Content from Vectorization

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I operate on all the elements of a vector at once?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To understand vectorized operations in R.
  • +
+
+
+
+
+
+

Most of R’s functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through and +act on each element one at a time. This makes writing code more concise, +easy to read, and less error prone.

+
+

R +

+
+x <- 1:4
+x * 2
+
+
+

OUTPUT +

+
[1] 2 4 6 8
+
+

The multiplication happened to each element of the vector.

+

We can also add two vectors together:

+
+

R +

+
+y <- 6:9
+x + y
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+

Each element of x was added to its corresponding element +of y:

+
+

R +

+
x:  1  2  3  4
+    +  +  +  +
+y:  6  7  8  9
+---------------
+    7  9 11 13
+
+

Here is how we would add two vectors together using a for loop:

+
+

R +

+
+output_vector <- c()
+for (i in 1:4) {
+  output_vector[i] <- x[i] + y[i]
+}
+output_vector
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+

Compare this to the output using vectorised operations.

+
+

R +

+
+sum_xy <- x + y
+sum_xy
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+
+
+ +
+
+

Challenge 1 +

+
+

Let’s try this on the pop column of the +gapminder dataset.

+

Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+
+
+
+
+
+ +
+
+

Let’s try this on the pop column of the +gapminder dataset.

+

Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+
+

R +

+
+gapminder$pop_millions <- gapminder$pop / 1e6
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap pop_millions
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453     8.425333
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530     9.240934
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007    10.267083
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971    11.537966
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811    13.079460
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134    14.880372
+
+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

On a single graph, plot population, in millions, against year, for +all countries. Do not worry about identifying which country is +which.

+

Repeat the exercise, graphing only for China, India, and Indonesia. +Again, do not worry about which is which.

+
+
+
+
+
+ +
+
+

Refresh your plotting skills by plotting population in millions +against year.

+
+

R +

+
+ggplot(gapminder, aes(x = year, y = pop_millions)) +
+ geom_point()
+
+
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.
+

R +

+
+countryset <- c("China","India","Indonesia")
+ggplot(gapminder[gapminder$country %in% countryset,],
+       aes(x = year, y = pop_millions)) +
+  geom_point()
+
+
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.
+
+
+
+
+

Comparison operators, logical operators, and many functions are also +vectorized:

+

Comparison operators

+
+

R +

+
+x > 2
+
+
+

OUTPUT +

+
[1] FALSE FALSE  TRUE  TRUE
+
+

Logical operators

+
+

R +

+
+a <- x > 3  # or, for clarity, a <- (x > 3)
+a
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE  TRUE
+
+
+
+ +
+
+

Tip: some useful functions for logical +vectors +

+
+

any() will return TRUE if any +element of a vector is TRUE.
all() will return TRUE if all +elements of a vector are TRUE.

+
+
+
+

Most functions also operate element-wise on vectors:

+

Functions

+
+

R +

+
+x <- 1:4
+log(x)
+
+
+

OUTPUT +

+
[1] 0.0000000 0.6931472 1.0986123 1.3862944
+
+

Vectorized operations work element-wise on matrices:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m * -1
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]   -1   -4   -7  -10
+[2,]   -2   -5   -8  -11
+[3,]   -3   -6   -9  -12
+
+
+
+ +
+
+

Tip: element-wise vs. matrix +multiplication +

+
+

Very important: the operator * gives you element-wise +multiplication! To do matrix multiplication, we need to use the +%*% operator:

+
+

R +

+
+m %*% matrix(1, nrow=4, ncol=1)
+
+
+

OUTPUT +

+
     [,1]
+[1,]   22
+[2,]   26
+[3,]   30
+
+
+

R +

+
+matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)
+
+
+

OUTPUT +

+
     [,1]
+[1,]   30
+
+

For more on matrix algebra, see the Quick-R +reference guide

+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Given the following matrix:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12
+
+

Write down what you think will happen when you run:

+
    +
  1. m ^ -1
  2. +
  3. m * c(1, 0, -1)
  4. +
  5. m > c(0, 20)
  6. +
  7. m * c(1, 0, -1, 2)
  8. +
+

Did you get the output you expected? If not, ask a helper!

+
+
+
+
+
+ +
+
+

Given the following matrix:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12
+
+

Write down what you think will happen when you run:

+
    +
  1. m ^ -1
  2. +
+
+

OUTPUT +

+
          [,1]      [,2]      [,3]       [,4]
+[1,] 1.0000000 0.2500000 0.1428571 0.10000000
+[2,] 0.5000000 0.2000000 0.1250000 0.09090909
+[3,] 0.3333333 0.1666667 0.1111111 0.08333333
+
+
    +
  1. m * c(1, 0, -1)
  2. +
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    0    0    0    0
+[3,]   -3   -6   -9  -12
+
+
    +
  1. m > c(0, 20)
  2. +
+
+

OUTPUT +

+
      [,1]  [,2]  [,3]  [,4]
+[1,]  TRUE FALSE  TRUE FALSE
+[2,] FALSE  TRUE FALSE  TRUE
+[3,]  TRUE FALSE  TRUE FALSE
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

We’re interested in looking at the sum of the following sequence of +fractions:

+
+

R +

+
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+
+

This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000?

+
+
+
+
+
+ +
+
+

We’re interested in looking at the sum of the following sequence of +fractions:

+
+

R +

+
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+
+

This would be tedious to type out, and impossible for high values of +n. Can you use vectorisation to compute x, when n=100? How about when +n=10,000?

+
+

R +

+
+sum(1/(1:100)^2)
+
+
+

OUTPUT +

+
[1] 1.634984
+
+
+

R +

+
+sum(1/(1:1e04)^2)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+

R +

+
+n <- 10000
+sum(1/(1:n)^2)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+

We can also obtain the same results using a function:

+
+

R +

+
+inverse_sum_of_squares <- function(n) {
+  sum(1/(1:n)^2)
+}
+inverse_sum_of_squares(100)
+
+
+

OUTPUT +

+
[1] 1.634984
+
+
+

R +

+
+inverse_sum_of_squares(10000)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+

R +

+
+n <- 10000
+inverse_sum_of_squares(n)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+
+
+
+
+
+ +
+
+

Tip: Operations on vectors of unequal +length +

+
+

Operations can also be performed on vectors of unequal length, +through a process known as recycling. This process +automatically repeats the smaller vector until it matches the length of +the larger vector. R will provide a warning if the larger vector is not +a multiple of the smaller vector.

+
+

R +

+
+x <- c(1, 2, 3)
+y <- c(1, 2, 3, 4, 5, 6, 7)
+x + y
+
+
+

WARNING +

+
Warning in x + y: longer object length is not a multiple of shorter object
+length
+
+
+

OUTPUT +

+
[1] 2 4 6 5 7 9 8
+
+

Vector x was recycled to match the length of vector +y

+
+

R +

+
x:  1  2  3  1  2  3  1
+    +  +  +  +  +  +  +
+y:  1  2  3  4  5  6  7
+-----------------------
+    2  4  6  5  7  9  8
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use vectorized operations instead of loops.
  • +
+
+
+

Content from Functions Explained

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I write a new function in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Define a function that takes arguments.
  • +
  • Return a value from a function.
  • +
  • Check argument conditions with stopifnot() in +functions.
  • +
  • Test a function.
  • +
  • Set default values for function arguments.
  • +
  • Explain why we should divide programs into small, single-purpose +functions.
  • +
+
+
+
+
+
+

If we only had one data set to analyze, it would probably be faster +to load the file into a spreadsheet and use that to plot simple +statistics. However, the gapminder data is updated periodically, and we +may want to pull in that new information later and re-run our analysis +again. We may also obtain similar data from a different source in the +future.

+

In this lesson, we’ll learn how to write a function so that we can +repeat several operations with a single command.

+
+
+ +
+
+

What is a function? +

+
+

Functions gather a sequence of operations into a whole, preserving it +for ongoing use. Functions provide:

+
    +
  • a name we can remember and invoke it by
  • +
  • relief from the need to remember the individual operations
  • +
  • a defined set of inputs and expected outputs
  • +
  • rich connections to the larger programming environment
  • +
+

As the basic building block of most programming languages, +user-defined functions constitute “programming” as much as any single +abstraction can. If you have written a function, you are a computer +programmer.

+
+
+
+

Defining a function +

+
+

Let’s open a new R script file in the functions/ +directory and call it functions-lesson.R.

+

The general structure of a function is:

+
+

R +

+
+my_function <- function(parameters) {
+  # perform action
+  # return value
+}
+
+

Let’s define a function fahr_to_kelvin() that converts +temperatures from Fahrenheit to Kelvin:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

We define fahr_to_kelvin() by assigning it to the output +of function. The list of argument names are contained +within parentheses. Next, the body of +the function–the statements that are executed when it runs–is contained +within curly braces ({}). The statements in the body are +indented by two spaces. This makes the code easier to read but does not +affect how the code operates.

+

It is useful to think of creating functions like writing a cookbook. +First you define the “ingredients” that your function needs. In this +case, we only need one ingredient to use our function: “temp”. After we +list our ingredients, we then say what we will do with them, in this +case, we are taking our ingredient and applying a set of mathematical +operators to it.

+

When we call the function, the values we pass to it as arguments are +assigned to those variables so that we can use them inside the function. +Inside the function, we use a return statement to send a +result back to whoever asked for it.

+
+
+ +
+
+

Tip +

+
+

One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the +body of the function. But for clarity, we will explicitly define the +return statement.

+
+
+
+

Let’s try running our function. Calling our own function is no +different from calling any other function:

+
+

R +

+
+# freezing point of water
+fahr_to_kelvin(32)
+
+
+

OUTPUT +

+
[1] 273.15
+
+
+

R +

+
+# boiling point of water
+fahr_to_kelvin(212)
+
+
+

OUTPUT +

+
[1] 373.15
+
+
+
+ +
+
+

Challenge 1 +

+
+

Write a function called kelvin_to_celsius() that takes a +temperature in Kelvin and returns that temperature in Celsius.

+

Hint: To convert from Kelvin to Celsius you subtract 273.15

+
+
+
+
+
+ +
+
+

Write a function called kelvin_to_celsius that takes a +temperature in Kelvin and returns that temperature in Celsius

+
+

R +

+
+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)
+}
+
+
+
+
+
+

Combining functions +

+
+

The real power of functions comes from mixing, matching and combining +them into ever-larger chunks to get the effect we want.

+

Let’s define two functions that will convert temperature from +Fahrenheit to Kelvin, and Kelvin to Celsius:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+kelvin_to_celsius <- function(temp) {
+  celsius <- temp - 273.15
+  return(celsius)
+}
+
+
+
+ +
+
+

Challenge 2 +

+
+

Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer).

+
+
+
+
+
+ +
+
+

Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above

+
+

R +

+
+fahr_to_celsius <- function(temp) {
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)
+}
+
+
+
+
+
+

Interlude: Defensive Programming +

+
+

Now that we’ve begun to appreciate how writing functions provides an +efficient way to make R code re-usable and modular, we should note that +it is important to ensure that functions only work in their intended +use-cases. Checking function parameters is related to the concept of +defensive programming. Defensive programming encourages us to +frequently check conditions and throw an error if something is wrong. +These checks are referred to as assertion statements because we want to +assert some condition is TRUE before proceeding. They make +it easier to debug because they give us a better idea of where the +errors originate.

+
+

Checking conditions with stopifnot() + +

+

Let’s start by re-examining fahr_to_kelvin(), our +function for converting temperatures from Fahrenheit to Kelvin. It was +defined like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

For this function to work as intended, the argument temp +must be a numeric value; otherwise, the mathematical +procedure for converting between the two temperature scales will not +work. To create an error, we can use the function stop(). +For example, since the argument temp must be a +numeric vector, we could check for this condition with an +if statement and throw an error if the condition was +violated. We could augment our function above like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  if (!is.numeric(temp)) {
+    stop("temp must be a numeric vector.")
+  }
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

If we had multiple conditions or arguments to check, it would take +many lines of code to check all of them. Luckily R provides the +convenience function stopifnot(). We can list as many +requirements that should evaluate to TRUE; +stopifnot() throws an error if it finds one that is +FALSE. Listing these conditions also serves a secondary +purpose as extra documentation for the function.

+

Let’s try out defensive programming with stopifnot() by +adding assertions to check the input to our function +fahr_to_kelvin().

+

We want to assert the following: temp is a numeric +vector. We may do that like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  stopifnot(is.numeric(temp))
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

It still works when given proper input.

+
+

R +

+
+# freezing point of water
+fahr_to_kelvin(temp = 32)
+
+
+

OUTPUT +

+
[1] 273.15
+
+

But fails instantly if given improper input.

+
+

R +

+
+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp = as.factor(32))
+
+
+

ERROR +

+
Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+
+
+
+ +
+
+

Challenge 3 +

+
+

Use defensive programming to ensure that our +fahr_to_celsius() function throws an error immediately if +the argument temp is specified inappropriately.

+
+
+
+
+
+ +
+
+

Extend our previous definition of the function by adding in an +explicit call to stopifnot(). Since +fahr_to_celsius() is a composition of two other functions, +checking inside here makes adding checks to the two component functions +redundant.

+
+

R +

+
+fahr_to_celsius <- function(temp) {
+  stopifnot(is.numeric(temp))
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)
+}
+
+
+
+
+
+
+

More on combining functions +

+
+

Now, we’re going to define a function that calculates the Gross +Domestic Product of a nation from the data available in our dataset:

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat) {
+  gdp <- dat$pop * dat$gdpPercap
+  return(gdp)
+}
+
+

We define calcGDP() by assigning it to the output of +function. The list of argument names are contained within +parentheses. Next, the body of the function -- the statements executed +when you call the function – is contained within curly braces +({}).

+

We’ve indented the statements in the body by two spaces. This makes +the code easier to read but does not affect how it operates.

+

When we call the function, the values we pass to it are assigned to +the arguments, which become variables inside the body of the +function.

+

Inside the function, we use the return() function to +send back the result. This return() function is optional: R +will automatically return the results of whatever command is executed on +the last line of the function.

+
+

R +

+
+calcGDP(head(gapminder))
+
+
+

OUTPUT +

+
[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231
+
+

That’s not very informative. Let’s add some more arguments so we can +extract that per year and country.

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

If you’ve been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by +using the source() function:

+
+

R +

+
+source("functions/functions-lesson.R")
+
+

Ok, so there’s a lot going on in this function now. In plain English, +the function now subsets the provided data by year if the year argument +isn’t empty, then subsets the result by country if the country argument +isn’t empty. Then it calculates the GDP for whatever subset emerges from +the previous two steps. The function then adds the GDP as a new column +to the subsetted data and returns this as the final result. You can see +that the output is much more informative than a vector of numbers.

+

Let’s take a look at what happens when we specify the year:

+
+

R +

+
+head(calcGDP(gapminder, year=2007))
+
+
+

OUTPUT +

+
       country year      pop continent lifeExp  gdpPercap          gdp
+12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
+24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
+36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
+48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
+60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
+72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894
+
+

Or for a specific country:

+
+

R +

+
+calcGDP(gapminder, country="Australia")
+
+
+

OUTPUT +

+
     country year      pop continent lifeExp gdpPercap          gdp
+61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
+62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
+63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
+64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
+65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
+66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
+67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
+68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
+69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
+70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
+71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894
+
+

Or both:

+
+

R +

+
+calcGDP(gapminder, year=2007, country="Australia")
+
+
+

OUTPUT +

+
     country year      pop continent lifeExp gdpPercap          gdp
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894
+
+

Let’s walk through the body of the function:

+
+

R +

+
calcGDP <- function(dat, year=NULL, country=NULL) {
+
+

Here we’ve added two arguments, year, and +country. We’ve set default arguments for both as +NULL using the = operator in the function +definition. This means that those arguments will take on those values +unless the user specifies otherwise.

+
+

R +

+
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+
+

Here, we check whether each additional argument is set to +null, and whenever they’re not null overwrite +the dataset stored in dat with a subset given by the +non-null argument.

+

Building these conditionals into the function makes it more flexible +for later. Now, we can use it to calculate the GDP for:

+
    +
  • The whole dataset;
  • +
  • A single year;
  • +
  • A single country;
  • +
  • A single combination of year and country.
  • +
+

By using %in% instead, we can also give multiple years +or countries to those arguments.

+
+
+ +
+
+

Tip: Pass by value +

+
+

Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify dat inside the +function we are modifying the copy of the gapminder dataset stored in +dat, not the original variable we gave as the first +argument.

+

This is called “pass-by-value” and it makes writing code much safer: +you can always be sure that whatever changes you make within the body of +the function, stay inside the body of the function.

+
+
+
+
+
+ +
+
+

Tip: Function scope +

+
+

Another important concept is scoping: any variables (or functions!) +you create or modify inside the body of a function only exist for the +lifetime of the function’s execution. When we call +calcGDP(), the variables dat, gdp +and new only exist inside the body of the function. Even if +we have variables of the same name in our interactive R session, they +are not modified in any way when executing a function.

+
+
+
+
+

R +

+
  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

Finally, we calculated the GDP on our new subset, and created a new +data frame with that column added. This means when we call the function +later we can see the context for the returned GDP values, which is much +better than in our first attempt where we got a vector of numbers.

+
+
+ +
+
+

Challenge 4 +

+
+

Test out your GDP function by calculating the GDP for New Zealand in +1987. How does this differ from New Zealand’s GDP in 1952?

+
+
+
+
+
+ +
+
+
+

R +

+
+  calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand")
+
+

GDP for New Zealand in 1987: 65050008703

+

GDP for New Zealand in 1952: 21058193787

+
+
+
+
+
+
+ +
+
+

Challenge 5 +

+
+

The paste() function can be used to combine text +together, e.g:

+
+

R +

+
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+paste(best_practice, collapse=" ")
+
+
+

OUTPUT +

+
[1] "Write programs for people not computers"
+
+

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:

+
+

R +

+
+fence(text=best_practice, wrapper="***")
+
+

Note: the paste() function has an argument +called sep, which specifies the separator between text. The +default is a space: ” “. The default for paste0() is no +space”“.

+
+
+
+
+
+ +
+
+

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:

+
+

R +

+
+fence <- function(text, wrapper){
+  text <- c(wrapper, text, wrapper)
+  result <- paste(text, collapse = " ")
+  return(result)
+}
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+fence(text=best_practice, wrapper="***")
+
+
+

OUTPUT +

+
[1] "*** Write programs for people not computers ***"
+
+
+
+
+
+
+
+ +
+
+

Tip +

+
+

R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the R +Language Manual or this chapter from Advanced R Programming by Hadley +Wickham.

+
+
+
+
+
+ +
+
+

Tip: Testing and documenting +

+
+

It’s important to both test functions and document them: +Documentation helps you, and others, understand what the purpose of your +function is, and how to use it, and its important to make sure that your +function actually does what you think.

+

When you first start out, your workflow will probably look a lot like +this:

+
    +
  1. Write a function
  2. +
  3. Comment parts of the function to document its behaviour
  4. +
  5. Load in the source file
  6. +
  7. Experiment with it in the console to make sure it behaves as you +expect
  8. +
  9. Make any necessary bug fixes
  10. +
  11. Rinse and repeat.
  12. +
+

Formal documentation for functions, written in separate +.Rd files, gets turned into the documentation you see in +help files. The roxygen2 +package allows R coders to write documentation alongside the function +code and then process it into the appropriate .Rd files. +You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In +fact, packages are, in essence, bundles of functions with this formal +documentation. Loading your own functions through +source("functions.R") is equivalent to loading someone +else’s functions (or your own one day!) through +library("package").

+

Formal automated tests can be written using the testthat package.

+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +
+
+
+
+

Content from Writing Data

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I save plots and data created in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to write out plots and data from R.
  • +
+
+
+
+
+
+

Saving plots +

+
+

You have already seen how to save the most recent plot you create in +ggplot2, using the command ggsave. As a +refresher:

+
+

R +

+
+ggsave("My_most_recent_plot.pdf")
+
+

You can save a plot from within RStudio using the ‘Export’ button in +the ‘Plot’ window. This will give you the option of saving as a .pdf or +as .png, .jpg or other image formats.

+

Sometimes you will want to save plots without creating them in the +‘Plot’ window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you’re looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can’t stop +the loop to click ‘Export’ for each one.

+

In this case you can use a more flexible approach. The function +pdf creates a new pdf device. You can control the size and +resolution using the arguments to this function.

+
+

R +

+
+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
+  geom_line() +
+  theme(legend.position = "none")
+
+# You then have to make sure to turn off the pdf device!
+
+dev.off()
+
+

Open up this document and have a look.

+
+
+ +
+
+

Challenge 1 +

+
+

Rewrite your ‘pdf’ command to print a second page in the pdf, showing +a facet plot (hint: use facet_grid) of the same data with +one panel per continent.

+
+
+
+
+
+ +
+
+
+

R +

+
+pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
+p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
+  geom_line() +
+  theme(legend.position = "none")
+p
+p + facet_grid(~continent)
+dev.off()
+
+
+
+
+
+

The commands jpeg, png etc. are used +similarly to produce documents in different formats.

+

Writing data +

+
+

At some point, you’ll also want to write out data from R.

+

We can use the write.table function for this, which is +very similar to read.table from before.

+

Let’s create a data-cleaning script, for this analysis, we only want +to focus on the gapminder data for Australia:

+
+

R +

+
+aust_subset <- gapminder[gapminder$country == "Australia",]
+
+write.table(aust_subset,
+  file="cleaned-data/gapminder-aus.csv",
+  sep=","
+)
+
+

Let’s switch back to the shell to take a look at the data to make +sure it looks OK:

+
+

BASH +

+
head cleaned-data/gapminder-aus.csv
+
+
+

OUTPUT +

+
"country","year","pop","continent","lifeExp","gdpPercap"
+"61","Australia",1952,8691212,"Oceania",69.12,10039.59564
+"62","Australia",1957,9712569,"Oceania",70.33,10949.64959
+"63","Australia",1962,10794968,"Oceania",70.93,12217.22686
+"64","Australia",1967,11872264,"Oceania",71.1,14526.12465
+"65","Australia",1972,13177000,"Oceania",71.93,16788.62948
+"66","Australia",1977,14074100,"Oceania",73.49,18334.19751
+"67","Australia",1982,15184200,"Oceania",74.74,19477.00928
+"68","Australia",1987,16257249,"Oceania",76.32,21888.88903
+"69","Australia",1992,17481977,"Oceania",77.56,23424.76683
+
+

Hmm, that’s not quite what we wanted. Where did all these quotation +marks come from? Also the row numbers are meaningless.

+

Let’s look at the help file to work out how to change this +behaviour.

+
+

R +

+
+?write.table
+
+

By default R will wrap character vectors with quotation marks when +writing out to file. It will also write out the row and column +names.

+

Let’s fix this:

+
+

R +

+
+write.table(
+  gapminder[gapminder$country == "Australia",],
+  file="cleaned-data/gapminder-aus.csv",
+  sep=",", quote=FALSE, row.names=FALSE
+)
+
+

Now lets look at the data again using our shell skills:

+
+

BASH +

+
head cleaned-data/gapminder-aus.csv
+
+
+

OUTPUT +

+
country,year,pop,continent,lifeExp,gdpPercap
+Australia,1952,8691212,Oceania,69.12,10039.59564
+Australia,1957,9712569,Oceania,70.33,10949.64959
+Australia,1962,10794968,Oceania,70.93,12217.22686
+Australia,1967,11872264,Oceania,71.1,14526.12465
+Australia,1972,13177000,Oceania,71.93,16788.62948
+Australia,1977,14074100,Oceania,73.49,18334.19751
+Australia,1982,15184200,Oceania,74.74,19477.00928
+Australia,1987,16257249,Oceania,76.32,21888.88903
+Australia,1992,17481977,Oceania,77.56,23424.76683
+
+

That looks better!

+
+
+ +
+
+

Challenge 2 +

+
+

Write a data-cleaning script file that subsets the gapminder data to +include only data points collected since 1990.

+

Use this script to write out the new subset to a file in the +cleaned-data/ directory.

+
+
+
+
+
+ +
+
+
+

R +

+
+write.table(
+  gapminder[gapminder$year > 1990, ],
+  file = "cleaned-data/gapminder-after1990.csv",
+  sep = ",", quote = FALSE, row.names = FALSE
+)
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +
+
+
+
+

Content from Splitting and Combining Data Frames with plyr

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I do different calculations on different sets of data?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to use the split-apply-combine strategy for data +analysis.
  • +
+
+
+
+
+
+

Previously we looked at how you can use functions to simplify your +code. We defined the calcGDP function, which takes the +gapminder dataset, and multiplies the population and GDP per capita +column. We also defined additional arguments so we could filter by +year and country:

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

A common task you’ll encounter when working with data, is that you’ll +want to run calculations on different groups within the data. In the +above, we were calculating the GDP by multiplying two columns together. +But what if we wanted to calculated the mean GDP per continent?

+

We could run calcGDP and then take the mean of each +continent:

+
+

R +

+
+withGDP <- calcGDP(gapminder)
+mean(withGDP[withGDP$continent == "Africa", "gdp"])
+
+
+

OUTPUT +

+
[1] 20904782844
+
+
+

R +

+
+mean(withGDP[withGDP$continent == "Americas", "gdp"])
+
+
+

OUTPUT +

+
[1] 379262350210
+
+
+

R +

+
+mean(withGDP[withGDP$continent == "Asia", "gdp"])
+
+
+

OUTPUT +

+
[1] 227233738153
+
+

But this isn’t very nice. Yes, by using a function, you have +reduced a substantial amount of repetition. That is +nice. But there is still repetition. Repeating yourself will cost you +time, both now and later, and potentially introduce some nasty bugs.

+

We could write a new function that is flexible like +calcGDP, but this also takes a substantial amount of effort +and testing to get right.

+

The abstract problem we’re encountering here is know as +“split-apply-combine”:

+
Split apply combine

We want to split our data into groups, in this case +continents, apply some calculations on that group, then +optionally combine the results together afterwards.

+

The plyr package +

+
+

For those of you who have used R before, you might be familiar with +the apply family of functions. While R’s built in functions +do work, we’re going to introduce you to another method for solving the +“split-apply-combine” problem. The plyr package provides a set of +functions that we find more user friendly for solving this problem.

+

We installed this package in an earlier challenge. Let us load it +now:

+
+

R +

+
+library("plyr")
+
+

Plyr has functions for operating on lists, +data.frames and arrays (matrices, or +n-dimensional vectors). Each function performs:

+
    +
  1. A splitting operation
  2. +
  3. +Apply a function on each split in turn.
  4. +
  5. Recombine output data as a single data object.
  6. +
+

The functions are named based on the data structure they expect as +input, and the data structure you want returned as output: [a]rray, +[l]ist, or [d]ata.frame. The first letter corresponds to the input data +structure, the second letter to the output data structure, and then the +rest of the function is named “ply”.

+

This gives us 9 core functions **ply. There are an additional three +functions which will only perform the split and apply steps, and not any +combine step. They’re named by their input data type and represent null +output by a _ (see table)

+

Note here that plyr’s use of “array” is different to R’s, an array in +ply can include a vector or matrix.

+
Full apply suite

Each of the xxply functions (daply, ddply, +llply, laply, …) has the same structure and +has 4 key features and structure:

+
+

R +

+
+xxply(.data, .variables, .fun)
+
+
    +
  • The first letter of the function name gives the input type and the +second gives the output type.
  • +
  • .data - gives the data object to be processed
  • +
  • .variables - identifies the splitting variables
  • +
  • .fun - gives the function to be called on each piece
  • +
+

Now we can quickly calculate the mean GDP per continent:

+
+

R +

+
+ddply(
+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
  continent           V1
+1    Africa  20904782844
+2  Americas 379262350210
+3      Asia 227233738153
+4    Europe 269442085301
+5   Oceania 188187105354
+
+

Let us walk through the previous code:

+
    +
  • The ddply function feeds in a data.frame +(function starts with d) and returns another +data.frame (2nd letter is a d)
  • +
  • the first argument we gave was the data.frame we wanted to operate +on: in this case the gapminder data. We called calcGDP on +it first so that it would have the additional gdp column +added to it.
  • +
  • The second argument indicated our split criteria: in this case the +“continent” column. Note that we gave the name of the column, not the +values of the column like we had done previously with subsetting. Plyr +takes care of these implementation details for you.
  • +
  • The third argument is the function we want to apply to each grouping +of the data. We had to define our own short function here: each subset +of the data gets stored in x, the first argument of our +function. This is an anonymous function: we haven’t defined it +elsewhere, and it has no name. It only exists in the scope of our call +to ddply.
  • +
+
+
+ +
+
+

Challenge 1 +

+
+

Calculate the average life expectancy per continent. Which has the +longest? Which has the shortest?

+
+
+
+
+
+ +
+
+
+

R +

+
+ddply(
+ .data = gapminder,
+ .variables = "continent",
+ .fun = function(x) mean(x$lifeExp)
+)
+
+

Oceania has the longest and Africa the shortest.

+
+
+
+
+

What if we want a different type of output data structure?:

+
+

R +

+
+dlply(
+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
$Africa
+[1] 20904782844
+
+$Americas
+[1] 379262350210
+
+$Asia
+[1] 227233738153
+
+$Europe
+[1] 269442085301
+
+$Oceania
+[1] 188187105354
+
+attr(,"split_type")
+[1] "data.frame"
+attr(,"split_labels")
+  continent
+1    Africa
+2  Americas
+3      Asia
+4    Europe
+5   Oceania
+
+

We called the same function again, but changed the second letter to +an l, so the output was returned as a list.

+

We can specify multiple columns to group by:

+
+

R +

+
+ddply(
+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
   continent year           V1
+1     Africa 1952   5992294608
+2     Africa 1957   7359188796
+3     Africa 1962   8784876958
+4     Africa 1967  11443994101
+5     Africa 1972  15072241974
+6     Africa 1977  18694898732
+7     Africa 1982  22040401045
+8     Africa 1987  24107264108
+9     Africa 1992  26256977719
+10    Africa 1997  30023173824
+11    Africa 2002  35303511424
+12    Africa 2007  45778570846
+13  Americas 1952 117738997171
+14  Americas 1957 140817061264
+15  Americas 1962 169153069442
+16  Americas 1967 217867530844
+17  Americas 1972 268159178814
+18  Americas 1977 324085389022
+19  Americas 1982 363314008350
+20  Americas 1987 439447790357
+21  Americas 1992 489899820623
+22  Americas 1997 582693307146
+23  Americas 2002 661248623419
+24  Americas 2007 776723426068
+25      Asia 1952  34095762661
+26      Asia 1957  47267432088
+27      Asia 1962  60136869012
+28      Asia 1967  84648519224
+29      Asia 1972 124385747313
+30      Asia 1977 159802590186
+31      Asia 1982 194429049919
+32      Asia 1987 241784763369
+33      Asia 1992 307100497486
+34      Asia 1997 387597655323
+35      Asia 2002 458042336179
+36      Asia 2007 627513635079
+37    Europe 1952  84971341466
+38    Europe 1957 109989505140
+39    Europe 1962 138984693095
+40    Europe 1967 173366641137
+41    Europe 1972 218691462733
+42    Europe 1977 255367522034
+43    Europe 1982 279484077072
+44    Europe 1987 316507473546
+45    Europe 1992 342703247405
+46    Europe 1997 383606933833
+47    Europe 2002 436448815097
+48    Europe 2007 493183311052
+49   Oceania 1952  54157223944
+50   Oceania 1957  66826828013
+51   Oceania 1962  82336453245
+52   Oceania 1967 105958863585
+53   Oceania 1972 134112109227
+54   Oceania 1977 154707711162
+55   Oceania 1982 176177151380
+56   Oceania 1987 209451563998
+57   Oceania 1992 236319179826
+58   Oceania 1997 289304255183
+59   Oceania 2002 345236880176
+60   Oceania 2007 403657044512
+
+
+

R +

+
+daply(
+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
          year
+continent          1952         1957         1962         1967         1972
+  Africa     5992294608   7359188796   8784876958  11443994101  15072241974
+  Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+  Asia      34095762661  47267432088  60136869012  84648519224 124385747313
+  Europe    84971341466 109989505140 138984693095 173366641137 218691462733
+  Oceania   54157223944  66826828013  82336453245 105958863585 134112109227
+          year
+continent          1977         1982         1987         1992         1997
+  Africa    18694898732  22040401045  24107264108  26256977719  30023173824
+  Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+  Asia     159802590186 194429049919 241784763369 307100497486 387597655323
+  Europe   255367522034 279484077072 316507473546 342703247405 383606933833
+  Oceania  154707711162 176177151380 209451563998 236319179826 289304255183
+          year
+continent          2002         2007
+  Africa    35303511424  45778570846
+  Americas 661248623419 776723426068
+  Asia     458042336179 627513635079
+  Europe   436448815097 493183311052
+  Oceania  345236880176 403657044512
+
+

You can use these functions in place of for loops (and +it is usually faster to do so). To replace a for loop, put the code that +was in the body of the for loop inside an anonymous +function.

+
+

R +

+
+d_ply(
+  .data=gapminder,
+  .variables = "continent",
+  .fun = function(x) {
+    meanGDPperCap <- mean(x$gdpPercap)
+    print(paste(
+      "The mean GDP per capita for", unique(x$continent),
+      "is", format(meanGDPperCap, big.mark=",")
+   ))
+  }
+)
+
+
+

OUTPUT +

+
[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+
+
+
+ +
+
+

Tip: printing numbers +

+
+

The format function can be used to make numeric values +“pretty” for printing out in messages.

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Calculate the average life expectancy per continent and year. Which +had the longest and shortest in 2007? Which had the greatest change in +between 1952 and 2007?

+
+
+
+
+
+ +
+
+
+

R +

+
+solution <- ddply(
+ .data = gapminder,
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$lifeExp)
+)
+solution_2007 <- solution[solution$year == 2007, ]
+solution_2007
+
+

Oceania had the longest average life expectancy in 2007 and Africa +the lowest.

+
+

R +

+
+solution_1952_2007 <- cbind(solution[solution$year == 1952, ], solution_2007)
+difference_1952_2007 <- data.frame(continent = solution_1952_2007$continent,
+                                   year_1957 = solution_1952_2007[[3]],
+                                   year_2007 = solution_1952_2007[[6]],
+                                   difference = solution_1952_2007[[6]] - solution_1952_2007[[3]])
+difference_1952_2007
+
+

Asia had the greatest difference, and Oceania the least.

+
+
+
+
+
+
+ +
+
+

Alternate Challenge +

+
+

Without running them, which of the following will calculate the +average life expectancy per continent:

+
  1. +
+
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = gapminder$continent,
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
  1. +
+
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = mean(dataGroup$lifeExp)
+)
+
+
  1. +
+
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
  1. +
+
+

R +

+
+adply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
+
+
+
+
+ +
+
+

Answer 3 will calculate the average life expectancy per +continent.

+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +
+
+
+
+

Content from Data Frame Manipulation with dplyr

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I manipulate data frames without repeating myself?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to use the six main data frame manipulation ‘verbs’ with +pipes in dplyr.
  • +
  • To understand how group_by() and +summarize() can be combined to summarize datasets.
  • +
  • Be able to analyze a subset of data using logical filtering.
  • +
+
+
+
+
+
+

Manipulation of data frames means many things to many researchers: we +often select certain observations (rows) or variables (columns), we +often group the data by a certain variable(s), or we even calculate +summary statistics. We can do these operations using the normal base R +operations:

+
+

R +

+
+mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 2193.755
+
+
+

R +

+
+mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 7136.11
+
+
+

R +

+
+mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 7902.15
+
+

But this isn’t very nice because there is a fair bit of +repetition. Repeating yourself will cost you time, both now and later, +and potentially introduce some nasty bugs.

+

The dplyr package +

+
+

Luckily, the dplyr +package provides a number of very useful functions for manipulating data +frames in a way that will reduce the above repetition, reduce the +probability of making errors, and probably even save you some typing. As +an added bonus, you might even find the dplyr grammar +easier to read.

+
+
+ +
+
+

Tip: Tidyverse +

+
+

dplyr package belongs to a broader family of opinionated +R packages designed for data science called the “Tidyverse”. These +packages are specifically designed to work harmoniously together. Some +of these packages will be covered along this course, but you can find +more complete information here: https://www.tidyverse.org/.

+
+
+
+

Here we’re going to cover 5 of the most commonly used functions as +well as using pipes (%>%) to combine them.

+
    +
  1. select()
  2. +
  3. filter()
  4. +
  5. group_by()
  6. +
  7. summarize()
  8. +
  9. mutate()
  10. +
+

If you have have not installed this package earlier, please do +so:

+
+

R +

+
+install.packages('dplyr')
+
+

Now let’s load the package:

+
+

R +

+
+library("dplyr")
+
+

Using select() +

+
+

If, for example, we wanted to move forward with only a few of the +variables in our data frame we could use the select() +function. This will keep only the variables you select.

+
+

R +

+
+year_country_gdp <- select(gapminder, year, country, gdpPercap)
+
+

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.

+
+

R +

+
+smaller_gapminder_data <- select(gapminder, -continent)
+
+

If we open up year_country_gdp we’ll see that it only +contains the year, country and gdpPercap. Above we used ‘normal’ +grammar, but the strengths of dplyr lie in combining +several functions using pipes. Since the pipes grammar is unlike +anything we’ve seen in R before, let’s repeat what we’ve done above +using pipes.

+
+

R +

+
+year_country_gdp <- gapminder %>% select(year, country, gdpPercap)
+
+

To help you understand why we wrote that in that way, let’s walk +through it step by step. First we summon the gapminder data frame and +pass it on, using the pipe symbol %>%, to the next step, +which is the select() function. In this case we don’t +specify which data object we use in the select() function +since in gets that from the previous pipe. Fun Fact: +There is a good chance you have encountered pipes before in the shell. +In R, a pipe symbol is %>% while in the shell it is +| but the concept is the same!

+
+
+ +
+
+

Tip: Renaming data frame columns in dplyr +

+
+

In Chapter 4 we covered how you can rename columns with base R by +assigning a value to the output of the names() function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a +rename() function.

+

Within a pipeline, the syntax is +rename(new_name = old_name). For example, we may want to +rename the gdpPercap column name from our select() +statement above.

+
+

R +

+
+tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)
+
+head(tidy_gdp)
+
+
+

OUTPUT +

+
  year     country gdp_per_capita
+1 1952 Afghanistan       779.4453
+2 1957 Afghanistan       820.8530
+3 1962 Afghanistan       853.1007
+4 1967 Afghanistan       836.1971
+5 1972 Afghanistan       739.9811
+6 1977 Afghanistan       786.1134
+
+
+
+
+

Using filter() +

+
+

If we now want to move forward with the above, but only with European +countries, we can combine select and +filter

+
+

R +

+
+year_country_gdp_euro <- gapminder %>%
+    filter(continent == "Europe") %>%
+    select(year, country, gdpPercap)
+
+

If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below.

+
+

R +

+
+europe_lifeExp_2007 <- gapminder %>%
+  filter(continent == "Europe", year == 2007) %>%
+  select(country, lifeExp)
+
+
+
+ +
+
+

Challenge 1 +

+
+

Write a single command (which can span multiple lines and includes +pipes) that will produce a data frame that has the African values for +lifeExp, country and year, but +not for other Continents. How many rows does your data frame have and +why?

+
+
+
+
+
+ +
+
+
+

R +

+
+year_country_lifeExp_Africa <- gapminder %>%
+                           filter(continent == "Africa") %>%
+                           select(year, country, lifeExp)
+
+
+
+
+
+

As with last time, first we pass the gapminder data frame to the +filter() function, then we pass the filtered version of the +gapminder data frame to the select() function. +Note: The order of operations is very important in this +case. If we used ‘select’ first, filter would not be able to find the +variable continent since we would have removed it in the previous +step.

+

Using group_by() +

+
+

Now, we were supposed to be reducing the error prone repetitiveness +of what can be done with base R, but up to now we haven’t done that +since we would have to repeat the above for each continent. Instead of +filter(), which will only pass observations that meet your +criteria (in the above: continent=="Europe"), we can use +group_by(), which will essentially use every unique +criteria that you could have used in filter.

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+
+

R +

+
+str(gapminder %>% group_by(continent))
+
+
+

OUTPUT +

+
gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
+  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
+  ..$ .rows    : list<int> [1:5] 
+  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
+  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
+  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
+  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
+  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
+  .. ..@ ptype: int(0) 
+  ..- attr(*, ".drop")= logi TRUE
+
+

You will notice that the structure of the data frame where we used +group_by() (grouped_df) is not the same as the +original gapminder (data.frame). A +grouped_df can be thought of as a list where +each item in the listis a data.frame which +contains only the rows that correspond to the a particular value +continent (at least in the example above).

+
Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize() +

+
+

The above was a bit on the uneventful side but +group_by() is much more exciting in conjunction with +summarize(). This will allow us to create new variable(s) +by using functions that repeat for each of the continent-specific data +frames. That is to say, using the group_by() function, we +split our original data frame into multiple pieces, then we can run +functions (e.g. mean() or sd()) within +summarize().

+
+

R +

+
+gdp_bycontinents <- gapminder %>%
+    group_by(continent) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
+
+
Diagram illustrating the use of group by and summarize together to create a new variable
+

R +

+
continent mean_gdpPercap
+     <fctr>          <dbl>
+1    Africa       2193.755
+2  Americas       7136.110
+3      Asia       7902.150
+4    Europe      14469.476
+5   Oceania      18621.609
+
+

That allowed us to calculate the mean gdpPercap for each continent, +but it gets even better.

+
+
+ +
+
+

Challenge 2 +

+
+

Calculate the average life expectancy per country. Which has the +longest average life expectancy and which has the shortest average life +expectancy?

+
+
+
+
+
+ +
+
+
+

R +

+
+lifeExp_bycountry <- gapminder %>%
+   group_by(country) %>%
+   summarize(mean_lifeExp = mean(lifeExp))
+lifeExp_bycountry %>%
+   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
+
+
+

OUTPUT +

+
# A tibble: 2 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Iceland              76.5
+2 Sierra Leone         36.8
+
+

Another way to do this is to use the dplyr function +arrange(), which arranges the rows in a data frame +according to the order of one or more variables from the data frame. It +has similar syntax to other functions from the dplyr +package. You can use desc() inside arrange() +to sort in descending order.

+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(mean_lifeExp) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Sierra Leone         36.8
+
+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(desc(mean_lifeExp)) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country mean_lifeExp
+  <chr>          <dbl>
+1 Iceland         76.5
+
+

Alphabetical order works too

+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(desc(country)) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country  mean_lifeExp
+  <chr>           <dbl>
+1 Zimbabwe         52.7
+
+
+
+
+
+

The function group_by() allows us to group by multiple +variables. Let’s group by year and +continent.

+
+

R +

+
+gdp_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

That is already quite powerful, but it gets even better! You’re not +limited to defining 1 new variable in summarize().

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

count() and n() +

+
+

A very common operation is to count the number of observations for +each group. The dplyr package comes with two related +functions that help with this.

+

For instance, if we wanted to check the number of countries included +in the dataset for the year 2002, we can use the count() +function. It takes the name of one or more columns that contain the +groups we are interested in, and we can optionally sort the results in +descending order by adding sort=TRUE:

+
+

R +

+
+gapminder %>%
+    filter(year == 2002) %>%
+    count(continent, sort = TRUE)
+
+
+

OUTPUT +

+
  continent  n
+1    Africa 52
+2      Asia 33
+3    Europe 30
+4  Americas 25
+5   Oceania  2
+
+

If we need to use the number of observations in calculations, the +n() function is useful. It will return the total number of +observations in the current group rather than counting the number of +observations in each group within a specific column. For instance, if we +wanted to get the standard error of the life expectency per +continent:

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    summarize(se_le = sd(lifeExp)/sqrt(n()))
+
+
+

OUTPUT +

+
# A tibble: 5 × 2
+  continent se_le
+  <chr>     <dbl>
+1 Africa    0.366
+2 Americas  0.540
+3 Asia      0.596
+4 Europe    0.286
+5 Oceania   0.775
+
+

You can also chain together several summary operations; in this case +calculating the minimum, maximum, +mean and se of each continent’s per-country +life-expectancy:

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    summarize(
+      mean_le = mean(lifeExp),
+      min_le = min(lifeExp),
+      max_le = max(lifeExp),
+      se_le = sd(lifeExp)/sqrt(n()))
+
+
+

OUTPUT +

+
# A tibble: 5 × 5
+  continent mean_le min_le max_le se_le
+  <chr>       <dbl>  <dbl>  <dbl> <dbl>
+1 Africa       48.9   23.6   76.4 0.366
+2 Americas     64.7   37.6   80.7 0.540
+3 Asia         60.1   28.8   82.6 0.596
+4 Europe       71.9   43.6   81.8 0.286
+5 Oceania      74.3   69.1   81.2 0.775
+
+

Using mutate() +

+
+

We can also create new variables prior to (or even after) summarizing +information using mutate().

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
+    group_by(continent,year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

Connect mutate with logical filtering: ifelse +

+
+

When creating new variables, we can hook this with a logical +condition. A simple combination of mutate() and +ifelse() facilitates filtering right where it is needed: in +the moment of creating something new. This easy-to-read statement is a +fast and powerful way of discarding certain data (even though the +overall dimension of the data frame will not change) or for updating +values depending on this given condition.

+
+

R +

+
+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
+    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+

R +

+
+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
+    mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              mean_gdpPercap_expected = mean(gdp_futureExpectation))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

Combining dplyr and ggplot2 +

+
+

First install and load ggplot2:

+
+

R +

+
+install.packages('ggplot2')
+
+
+

R +

+
+library("ggplot2")
+
+

In the plotting lesson we looked at how to make a multi-panel figure +by adding a layer of facet panels using ggplot2. Here is +the code we used (with some extra comments):

+
+

R +

+
+# Filter countries located in the Americas
+americas <- gapminder[gapminder$continent == "Americas", ]
+# Make the plot
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

This code makes the right plot but it also creates an intermediate +variable (americas) that we might not have any other uses +for. Just as we used %>% to pipe data along a chain of +dplyr functions we can use it to pass data to +ggplot(). Because %>% replaces the first +argument in a function we don’t need to specify the data = +argument in the ggplot() function. By combining +dplyr and ggplot2 functions we can make the +same figure without creating any new variables or modifying the +data.

+
+

R +

+
+gapminder %>%
+  # Filter countries located in the Americas
+  filter(continent == "Americas") %>%
+  # Make the plot
+  ggplot(mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

More examples of using the function mutate() and the +ggplot2 package.

+
+

R +

+
+gapminder %>%
+  # extract first letter of country name into new column
+  mutate(startsWith = substr(country, 1, 1)) %>%
+  # only keep countries starting with A or Z
+  filter(startsWith %in% c("A", "Z")) %>%
+  # plot lifeExp into facets
+  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
+  geom_line() +
+  facet_wrap(vars(country)) +
+  theme_minimal()
+
+
+
+ +
+
+

Advanced Challenge +

+
+

Calculate the average life expectancy in 2002 of 2 randomly selected +countries for each continent. Then arrange the continent names in +reverse order. Hint: Use the dplyr +functions arrange() and sample_n(), they have +similar syntax to other dplyr functions.

+
+
+
+
+
+ +
+
+
+

R +

+
+lifeExp_2countries_bycontinents <- gapminder %>%
+   filter(year==2002) %>%
+   group_by(continent) %>%
+   sample_n(2) %>%
+   summarize(mean_lifeExp=mean(lifeExp)) %>%
+   arrange(desc(mean_lifeExp))
+
+
+
+
+
+

Other great resources +

+
+ +
+
+ +
+
+

Keypoints +

+
+
    +
  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +
+
+
+
+

Content from Data Frame Manipulation with tidyr

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I change the layout of a data frame?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To understand the concepts of ‘longer’ and ‘wider’ data frame +formats and be able to convert between them with +tidyr.
  • +
+
+
+
+
+
+

Researchers often want to reshape their data frames from ‘wide’ to +‘longer’ layouts, or vice-versa. The ‘long’ layout or format is +where:

+
    +
  • each column is a variable
  • +
  • each row is an observation
  • +
+

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column +for the observed variable and the other columns are ID variables.

+

For the ‘wide’ format each row is often a site/subject/patient and +you have multiple observation variables containing the same type of +data. These can be either repeated observations over time, or +observation of multiple variables (or a mix of both). You may find data +input may be simpler or some other applications may prefer the ‘wide’ +format. However, many of R‘s functions have been designed +assuming you have ’longer’ formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format.

+
Diagram illustrating the difference between a wide versus long layout of a data frame

Long and wide data frame layouts mainly affect readability. For +humans, the wide format is often more intuitive since we can often see +more of the data on the screen due to its shape. However, the long +format is more machine readable and is closer to the formatting of +databases. The ID variables in our data frames are similar to the fields +in a database and observed variables are like the database values.

+

Getting started +

+
+

First install the packages if you haven’t already done so (you +probably installed dplyr in the previous lesson):

+
+

R +

+
+#install.packages("tidyr")
+#install.packages("dplyr")
+
+

Load the packages

+
+

R +

+
+library("tidyr")
+library("dplyr")
+
+

First, lets look at the structure of our original gapminder data +frame:

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+
+
+ +
+
+

Challenge 1 +

+
+

Is gapminder a purely long, purely wide, or some intermediate +format?

+
+
+
+
+
+ +
+
+

The original gapminder data.frame is in an intermediate format. It is +not purely long since it had multiple observation variables +(pop,lifeExp,gdpPercap).

+
+
+
+
+

Sometimes, as with the gapminder dataset, we have multiple types of +observed data. It is somewhere in between the purely ‘long’ and ‘wide’ +data formats. We have 3 “ID variables” (continent, +country, year) and 3 “Observation variables” +(pop,lifeExp,gdpPercap). This +intermediate format can be preferred despite not having ALL observations +in 1 column given that all 3 observation variables have different units. +There are few operations that would need us to make this data frame any +longer (i.e. 4 ID variables and 1 Observation variable).

+

While using many of the functions in R, which are often vector based, +you usually do not want to do mathematical operations on values with +different units. For example, using the purely long format, a single +mean for all of the values of population, life expectancy, and GDP would +not be meaningful since it would return the mean of values with 3 +incompatible units. The solution is that we first manipulate the data +either by grouping (see the lesson on dplyr), or we change +the structure of the data frame. Note: Some plotting +functions in R actually work better in the wide format data.

+

From wide to long format with pivot_longer() +

+
+

Until now, we’ve been using the nicely formatted original gapminder +dataset, but ‘real’ data (i.e. our own research data) will never be so +well organized. Here let’s start with the wide formatted version of the +gapminder dataset.

+
+

Download the wide version of the gapminder data from here and save it in your data +folder.

+
+

We’ll load the data file and look at it. Note: we don’t want our +continent and country columns to be factors, so we use the +stringsAsFactors argument for read.csv() to disable +that.

+
+

R +

+
+gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
+str(gap_wide)
+
+
+

OUTPUT +

+
'data.frame':	142 obs. of  38 variables:
+ $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num  3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num  2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
+ $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
+
+
Diagram illustrating the wide format of the gapminder data frame

To change this very wide data frame layout back to our nice, +intermediate (or longer) layout, we will use one of the two available +pivot functions from the tidyr package. To +convert from wide to a longer format, we will use the +pivot_longer() function. pivot_longer() makes +datasets longer by increasing the number of rows and decreasing the +number of columns, or ‘lengthening’ your observation variables into a +single variable.

+
Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format
+

R +

+
+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+    names_to = "obstype_year", values_to = "obs_values"
+  )
+str(gap_long)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
+ $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...
+
+

Here we have used piping syntax which is similar to what we were +doing in the previous lesson with dplyr. In fact, these are compatible +and you can use a mix of tidyr and dplyr functions by piping them +together.

+

We first provide to pivot_longer() a vector of column +names that will be pivoted into longer format. We could type out all the +observation variables, but as in the select() function (see +dplyr lesson), we can use the starts_with() +argument to select all variables that start with the desired character +string. pivot_longer() also allows the alternative syntax +of using the - symbol to identify which variables are not +to be pivoted (i.e. ID variables).

+

The next arguments to pivot_longer() are +names_to for naming the column that will contain the new ID +variable (obstype_year) and values_to for +naming the new amalgamated observation variable +(obs_value). We supply these new column names as +strings.

+
Diagram illustrating the long format of the gapminder data
+

R +

+
+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(-continent, -country),
+    names_to = "obstype_year", values_to = "obs_values"
+  )
+str(gap_long)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values  : num [1:5112] 2449 3014 2551 3247 4183 ...
+
+

That may seem trivial with this particular data frame, but sometimes +you have 1 ID variable and 40 observation variables with irregular +variable names. The flexibility is a huge time saver!

+

Now obstype_year actually contains 2 pieces of +information, the observation type +(pop,lifeExp, or gdpPercap) and +the year. We can use the separate() function +to split the character strings into multiple variables

+
+

R +

+
+gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
+gap_long$year <- as.integer(gap_long$year)
+
+
+
+ +
+
+

Challenge 2 +

+
+

Using gap_long, calculate the mean life expectancy, +population, and gdpPercap for each continent. Hint: use +the group_by() and summarize() functions we +learned in the dplyr lesson

+
+
+
+
+
+ +
+
+
+

R +

+
+gap_long %>% group_by(continent, obs_type) %>%
+   summarize(means=mean(obs_values))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+

OUTPUT +

+
# A tibble: 15 × 3
+# Groups:   continent [5]
+   continent obs_type       means
+   <chr>     <chr>          <dbl>
+ 1 Africa    gdpPercap     2194. 
+ 2 Africa    lifeExp         48.9
+ 3 Africa    pop        9916003. 
+ 4 Americas  gdpPercap     7136. 
+ 5 Americas  lifeExp         64.7
+ 6 Americas  pop       24504795. 
+ 7 Asia      gdpPercap     7902. 
+ 8 Asia      lifeExp         60.1
+ 9 Asia      pop       77038722. 
+10 Europe    gdpPercap    14469. 
+11 Europe    lifeExp         71.9
+12 Europe    pop       17169765. 
+13 Oceania   gdpPercap    18622. 
+14 Oceania   lifeExp         74.3
+15 Oceania   pop        8874672. 
+
+
+
+
+
+

From long to intermediate format with pivot_wider() +

+
+

It is always good to check work. So, let’s use the second +pivot function, pivot_wider(), to ‘widen’ our +observation variables back out. pivot_wider() is the +opposite of pivot_longer(), making a dataset wider by +increasing the number of columns and decreasing the number of rows. We +can use pivot_wider() to pivot or reshape our +gap_long to the original intermediate format or the widest +format. Let’s start with the intermediate format.

+

The pivot_wider() function takes names_from +and values_from arguments.

+

To names_from we supply the column name whose contents +will be pivoted into new output columns in the widened data frame. The +corresponding values will be added from the column named in the +values_from argument.

+
+

R +

+
+gap_normal <- gap_long %>%
+  pivot_wider(names_from = obs_type, values_from = obs_values)
+dim(gap_normal)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+
+

R +

+
+dim(gapminder)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+
+

R +

+
+names(gap_normal)
+
+
+

OUTPUT +

+
[1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"      
+
+
+

R +

+
+names(gapminder)
+
+
+

OUTPUT +

+
[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
+
+

Now we’ve got an intermediate data frame gap_normal with +the same dimensions as the original gapminder, but the +order of the variables is different. Let’s fix that before checking if +they are all.equal().

+
+

R +

+
+gap_normal <- gap_normal[, names(gapminder)]
+all.equal(gap_normal, gapminder)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+[3] "Component \"country\": 1704 string mismatches"                                         
+[4] "Component \"pop\": Mean relative difference: 1.634504"                                 
+[5] "Component \"continent\": 1212 string mismatches"                                       
+[6] "Component \"lifeExp\": Mean relative difference: 0.203822"                             
+[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"                           
+
+
+

R +

+
+head(gap_normal)
+
+
+

OUTPUT +

+
# A tibble: 6 × 6
+  country  year      pop continent lifeExp gdpPercap
+  <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>
+1 Algeria  1952  9279525 Africa       43.1     2449.
+2 Algeria  1957 10270856 Africa       45.7     3014.
+3 Algeria  1962 11000948 Africa       48.3     2551.
+4 Algeria  1967 12760499 Africa       51.4     3247.
+5 Algeria  1972 14760787 Africa       54.5     4183.
+6 Algeria  1977 17152804 Africa       58.0     4910.
+
+
+

R +

+
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+
+

We’re almost there, the original was sorted by country, +then year.

+
+

R +

+
+gap_normal <- gap_normal %>% arrange(country, year)
+all.equal(gap_normal, gapminder)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+
+

That’s great! We’ve gone from the longest format back to the +intermediate and we didn’t introduce any errors in our code.

+

Now let’s convert the long all the way back to the wide. In the wide +format, we will keep country and continent as ID variables and pivot the +observations across the 3 metrics +(pop,lifeExp,gdpPercap) and time +(year). First we need to create appropriate labels for all +our new variables (time*metric combinations) and we also need to unify +our ID variables to simplify the process of defining +gap_wide.

+
+

R +

+
+gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")
+str(gap_temp)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ var_ID    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ obs_type  : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
+ $ year      : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+
+
+

R +

+
+gap_temp <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_")
+str(gap_temp)
+
+
+

OUTPUT +

+
tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
+ $ ID_var    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+
+

Using unite() we now have a single ID variable which is +a combination of continent,country,and we have +defined variable names. We’re now ready to pipe in +pivot_wider()

+
+

R +

+
+gap_wide_new <- gap_long %>%
+  unite(ID_var, continent, country, sep = "_") %>%
+  unite(var_names, obs_type, year, sep = "_") %>%
+  pivot_wider(names_from = var_names, values_from = obs_values)
+str(gap_wide_new)
+
+
+

OUTPUT +

+
tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
+ $ ID_var        : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+
+
+
+ +
+
+

Challenge 3 +

+
+

Take this 1 step further and create a +gap_ludicrously_wide format data by pivoting over +countries, year and the 3 metrics? Hint this new data +frame should only have 5 rows.

+
+
+
+
+
+ +
+
+
+

R +

+
+gap_ludicrously_wide <- gap_long %>%
+   unite(var_names, obs_type, year, country, sep = "_") %>%
+   pivot_wider(names_from = var_names, values_from = obs_values)
+
+
+
+
+
+

Now we have a great ‘wide’ format data frame, but the +ID_var could be more usable, let’s separate it into 2 +variables with separate()

+
+

R +

+
+gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
+gap_wide_betterID <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_") %>%
+    pivot_wider(names_from = var_names, values_from = obs_values) %>%
+    separate(ID_var, c("continent","country"), sep = "_")
+str(gap_wide_betterID)
+
+
+

OUTPUT +

+
tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
+ $ continent     : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+
+
+

R +

+
+all.equal(gap_wide, gap_wide_betterID)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+
+

There and back again!

+

Other great resources +

+
+ +
+
+ +
+
+

Keypoints +

+
+
    +
  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +
+
+
+
+

Content from Producing Reports With knitr

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I integrate software and reports?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Understand the value of writing reproducible reports
  • +
  • Learn how to recognise and compile the basic components of an R +Markdown file
  • +
  • Become familiar with R code chunks, and understand their purpose, +structure and options
  • +
  • Demonstrate the use of inline chunks for weaving R outputs into text +blocks, for example when discussing the results of some +calculations
  • +
  • Be aware of alternative output formats to which an R Markdown file +can be exported
  • +
+
+
+
+
+
+

Data analysis reports +

+
+

Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their work +for future reference.

+

Many new users begin by first writing a single R script containing +all of their work, and then share the analysis by emailing the script +and various graphs as attachments. But this can be cumbersome, requiring +a lengthy discussion to explain which attachment was which result.

+

Writing formal reports with Word or LaTeX can simplify this +process by incorporating both the analysis report and output graphs into +a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy +“whack-a-mole” game of fixing new mistakes resulting from a single +formatting change.

+

Creating a report as a web page (which is an html file) using R +Markdown makes things easier. The report can be one long stream, so tall +figures that wouldn’t ordinarily fit on one page can be kept at full +size and easier to read, since the reader can simply keep scrolling. +Additionally, the formatting of and R Markdown document is simple and +easy to modify, allowing you to spend more time on your analyses instead +of writing reports.

+

Literate programming +

+
+

Ideally, such analysis reports are reproducible documents: +If an error is discovered, or if some additional subjects are added to +the data, you can just re-compile the report and get the new or +corrected results rather than having to reconstruct figures, paste them +into a Word document, and hand-edit various detailed results.

+

The key R package here is knitr. It allows you +to create a document that is a mixture of text and chunks of code. When +the document is processed by knitr, chunks of code will be +executed, and graphs or other results will be inserted into the final +document.

+

This sort of idea has been called “literate programming”.

+

knitr allows you to mix basically any type of text with +code from different programming languages, but we recommend that you use +R Markdown, which mixes Markdown with R. Markdown is a light-weight +mark-up language for creating web pages.

+

Creating an R Markdown file +

+
+

Within RStudio, click File → New File → R Markdown and you’ll get a +dialog box like this:

+
Screenshot of the New R Markdown file dialogue box in RStudio

You can stick with the default (HTML output), but give it a +title.

+

Basic components of R Markdown +

+
+

The initial chunk of text (header) contains instructions for R to +specify what kind of document will be created, and the options chosen. +You can use the header to give your document a title, author, date, and +tell it what type of output you want to produce. In this case, we’re +creating an html document.

+
---
+title: "Initial R Markdown document"
+author: "Karl Broman"
+date: "April 23, 2015"
+output: html_document
+---
+

You can delete any of those fields if you don’t want them included. +The double-quotes aren’t strictly necessary in this case. +They’re mostly needed if you want to include a colon in the title.

+

RStudio creates the document with some example text to get you +started. Note below that there are chunks like

+
+```{r}
+summary(cars)
+```
+
+

These are chunks of R code that will be executed by +knitr and replaced by their results. More on this +later.

+

Markdown +

+
+

Markdown is a system for writing web pages by marking up the text +much as you would in an email rather than writing html code. The +marked-up text gets converted to html, replacing the marks with +the proper html code.

+

For now, let’s delete all of the stuff that’s there and write a bit +of markdown.

+

You make things bold using two asterisks, like this: +**bold**, and you make things italics by using +underscores, like this: _italics_.

+

You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this:

+
A list:
+
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks
+

or like this:

+
A second list:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+

Each will appear as:

+
    +
  • bold with double-asterisks
  • +
  • italics with underscores
  • +
  • code-type font with backticks
  • +
+

You can use whatever method you prefer, but be consistent. +This maintains the readability of your code.

+

You can make a numbered list by just using numbers. You can even use +the same number over and over if you want:

+
1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks
+

This will appear as:

+
    +
  1. bold with double-asterisks
  2. +
  3. italics with underscores
  4. +
  5. code-type font with backticks
  6. +
+

You can make section headers of different sizes by initiating a line +with some number of # symbols:

+
# Title
+## Main section
+### Sub-section
+#### Sub-sub section
+

You compile the R Markdown document to an html webpage by +clicking the “Knit” button in the upper-left.

+
+
+ +
+
+

Challenge 1 +

+
+

Create a new R Markdown document. Delete all of the R code chunks and +write a bit of Markdown (some sections, some italicized text, and an +itemized list).

+

Convert the document to a webpage.

+
+
+
+
+
+ +
+
+

In RStudio, select File > New file > R Markdown…

+

Delete the placeholder text and add the following:

+
# Introduction
+
+## Background on Data
+
+This report uses the *gapminder* dataset, which has columns that include:
+
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+
+## Background on Methods
+
+

Then click the ‘Knit’ button on the toolbar to generate an html +document (webpage).

+
+
+
+
+

A bit more Markdown +

+
+

You can make a hyperlink like this: +[Carpentries Home Page](https://carpentries.org/).

+

You can include an image file like this: +![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)

+

You can do subscripts (e.g., F2) with F~2~ +and superscripts (e.g., F2) with F^2^.

+

If you know how to write equations in LaTeX, you can use +$ $ and $$ $$ to insert math equations, like +$E = mc^2$ and

+
$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$
+

You can review Markdown syntax by navigating to the “Markdown Quick +Reference” under the “Help” field in the toolbar at the top of +RStudio.

+

R code chunks +

+
+

The real power of Markdown comes from mixing markdown with chunks of +code. This is R Markdown. When processed, the R code will be executed; +if they produce figures, the figures will be inserted in the final +document.

+

The main code chunks look like this:

+
+```{r load_data}
+gapminder 
+

That is, you place a chunk of R code between ```{r +chunk_name} and ```. You should give each chunk a +unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the +shortcuts Ctrl+Alt+I on Windows and +Linux, or Cmd+Option+I on Mac.

+
+
+ +
+
+

Challenge 2 +

+
+

Add code chunks to:

+
    +
  • Load the ggplot2 package
  • +
  • Read the gapminder data
  • +
  • Create a plot
  • +
+
+
+
+
+
+ +
+
+
+```{r load-ggplot2}
+library("ggplot2")
+```
+
+
+```{r read-gapminder-data}
+gapminder 
+
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+
+
+
+
+
+

How things get compiled +

+
+

When you press the “Knit” button, the R Markdown document is +processed by knitr +and a plain Markdown document is produced (as well as, potentially, a +set of figure files): the R code is executed and replaced by both the +input and the output; if figures are produced, links to those figures +are included.

+

The Markdown and figure documents are then processed by the tool pandoc, which converts the +Markdown file into an html file, with the figures embedded.

+

Chunk options +

+
+

There are a variety of options to affect how the code chunks are +treated. Here are some examples:

+
    +
  • Use echo=FALSE to avoid having the code itself +shown.
  • +
  • Use results="hide" to avoid having any results +printed.
  • +
  • Use eval=FALSE to have the code shown but not +evaluated.
  • +
  • Use warning=FALSE and message=FALSE to +hide any warnings or messages produced.
  • +
  • Use fig.height and fig.width to control +the size of the figures produced (in inches).
  • +
+

So you might write:

+
+```{r load_libraries, echo=FALSE, message=FALSE}
+library("dplyr")
+library("ggplot2")
+```
+
+

Often there will be particular options that you’ll want to use +repeatedly; for this, you can set global chunk options, like +so:

+
+```{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)
+```
+
+

The fig.path option defines where the figures will be +saved. The / here is really important; without it, the +figures would be saved in the standard place but just with names that +begin with Figs.

+

If you have multiple R Markdown files in a common directory, you +might want to use fig.path to define separate prefixes for +the figure file names, like fig.path="Figs/cleaning-" and +fig.path="Figs/analysis-".

+
+
+ +
+
+

Challenge 3 +

+
+

Use chunk options to control the size of a figure and to hide the +code.

+
+
+
+
+
+ +
+
+
+```{r echo = FALSE, fig.width = 3}
+plot(faithful)
+```
+
+
+
+
+
+

You can review all of the R chunk options by navigating +to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the +“Help” field in the toolbar at the top of RStudio.

+

Inline R code +

+
+

You can make every number in your report reproducible. Use +`r and ` for an in-line code chunk, like so: +`r round(some_value, 2)`. The code will be executed and +replaced with the value of the result.

+

Don’t let these in-line chunks get split across lines.

+

Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with include=FALSE for +that larger chunk (which is the same as echo=FALSE and +results="hide").

+

Rounding can produce differences in output in such situations. You +may want 2.0, but round(2.03, 1) will give +just 2.

+

The myround +function in the R/broman +package handles this.

+
+
+ +
+
+

Challenge 4 +

+
+

Try out a bit of in-line R code.

+
+
+
+
+
+ +
+
+

Here’s some inline code to determine that 2 + 2 = 4.

+
+
+
+
+

Other output options +

+
+

You can also convert R Markdown to a PDF or a Word document. Click +the little triangle next to the “Knit” button to get a drop-down menu. +Or you could put pdf_document or word_document +in the initial header of the file.

+
+
+ +
+
+

Tip: Creating PDF documents +

+
+

Creating .pdf documents may require installation of some extra +software. The R package tinytex provides some tools to help +make this process easier for R users. With tinytex +installed, run tinytex::install_tinytex() to install the +required software (you’ll only need to do this once) and then when you +knit to pdf tinytex will automatically detect and install +any additional LaTeX packages that are needed to produce the pdf +document. Visit the tinytex +website for more information.

+
+
+
+
+
+ +
+
+

Tip: Visual markdown editing in RStudio +

+
+

RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.

+
+
+
+

Resources +

+
+ +
+
+ +
+
+

Keypoints +

+
+
    +
  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +
+
+
+
+

Content from Writing Good Software

+
+

Last updated on 2023-10-26 | + + Edit this page

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I write software that other people can use?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Describe best practices for writing R and explain the justification +for each.
  • +
+
+
+
+
+
+

Structure your project folder +

+
+

Keep your project folder structured, organized and tidy, by creating +subfolders for your code files, manuals, data, binaries, output plots, +etc. It can be done completely manually, or with the help of RStudio’s +New Project functionality, or a designated package, such as +ProjectTemplate.

+
+
+ +
+
+

Tip: ProjectTemplate - a possible +solution +

+
+

One way to automate the management of projects is to install the +third-party package, ProjectTemplate. This package will set +up an ideal directory structure for project management. This is very +useful as it enables you to have your analysis pipeline/workflow +organised and structured. Together with the default RStudio project +functionality and Git you will be able to keep track of your work as +well as be able to share your work with collaborators.

+
    +
  1. Install ProjectTemplate.
  2. +
  3. Load the library
  4. +
  5. Initialise the project:
  6. +
+
+

R +

+
+install.packages("ProjectTemplate")
+library("ProjectTemplate")
+create.project("../my_project_2", merge.strategy = "allow.non.conflict")
+
+

For more information on ProjectTemplate and its functionality visit +the home page ProjectTemplate

+
+
+
+

Make code readable +

+
+

The most important part of writing code is making it readable and +understandable. You want someone else to be able to pick up your code +and be able to understand what it does: more often than not this someone +will be you 6 months down the line, who will otherwise be cursing +past-self.

+

Documentation: tell us what and why, not how +

+
+

When you first start out, your comments will often describe what a +command does, since you’re still learning yourself and it can help to +clarify concepts and remind you later. However, these comments aren’t +particularly useful later on when you don’t remember what problem your +code is trying to solve. Try to also include comments that tell you +why you’re solving a problem, and what problem that +is. The how can come after that: it’s an implementation detail +you ideally shouldn’t have to worry about.

+

Keep your code modular +

+
+

Our recommendation is that you should separate your functions from +your analysis scripts, and store them in a separate file that you +source when you open the R session in your project. This +approach is nice because it leaves you with an uncluttered analysis +script, and a repository of useful functions that can be loaded into any +analysis script in your project. It also lets you group related +functions together easily.

+

Break down problem into bite size pieces +

+
+

When you first start out, problem solving and function writing can be +daunting tasks, and hard to separate from code inexperience. Try to +break down your problem into digestible chunks and worry about the +implementation details later: keep breaking down the problem into +smaller and smaller functions until you reach a point where you can code +a solution, and build back up from there.

+

Know that your code is doing the right thing +

+
+

Make sure to test your functions!

+

Don’t repeat yourself +

+
+

Functions enable easy reuse within a project. If you see blocks of +similar lines of code through your project, those are usually candidates +for being moved into functions.

+

If your calculations are performed through a series of functions, +then the project becomes more modular and easier to change. This is +especially the case for which a particular input always gives a +particular output.

+

Remember to be stylish +

+
+

Apply consistent style to your code.

+
+
+ +
+
+

Keypoints +

+
+
    +
  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+
+
+
+
+
+
+
+ + +
+ + +
+
+ +
Back To Top +
+
+ + + + diff --git a/android-chrome-192x192.png b/android-chrome-192x192.png new file mode 100644 index 000000000..ed3c210ab Binary files /dev/null and b/android-chrome-192x192.png differ diff --git a/android-chrome-512x512.png b/android-chrome-512x512.png new file mode 100644 index 000000000..c88d96c1c Binary files /dev/null and b/android-chrome-512x512.png differ diff --git a/apple-touch-icon.png b/apple-touch-icon.png new file mode 100644 index 000000000..8044feefd Binary files /dev/null and b/apple-touch-icon.png differ diff --git a/assets/fonts/Mulish-Bold.ttf b/assets/fonts/Mulish-Bold.ttf new file mode 100644 index 000000000..1f522d476 Binary files /dev/null and b/assets/fonts/Mulish-Bold.ttf differ diff --git a/assets/fonts/Mulish-Bold.woff b/assets/fonts/Mulish-Bold.woff new file mode 100644 index 000000000..711448ea9 Binary files /dev/null and b/assets/fonts/Mulish-Bold.woff differ diff --git a/assets/fonts/Mulish-ExtraBold.ttf b/assets/fonts/Mulish-ExtraBold.ttf new file mode 100644 index 000000000..62850fff3 Binary files /dev/null and b/assets/fonts/Mulish-ExtraBold.ttf differ diff --git a/assets/fonts/mulish-v5-latin-regular.eot b/assets/fonts/mulish-v5-latin-regular.eot new file mode 100644 index 000000000..423bcb17a Binary files /dev/null and b/assets/fonts/mulish-v5-latin-regular.eot differ diff --git a/assets/fonts/mulish-v5-latin-regular.svg b/assets/fonts/mulish-v5-latin-regular.svg new file mode 100644 index 000000000..70341f98b --- /dev/null +++ b/assets/fonts/mulish-v5-latin-regular.svg @@ -0,0 +1,305 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/assets/fonts/mulish-v5-latin-regular.ttf b/assets/fonts/mulish-v5-latin-regular.ttf new file mode 100644 index 000000000..541bb406e Binary files /dev/null and b/assets/fonts/mulish-v5-latin-regular.ttf differ diff --git a/assets/fonts/mulish-v5-latin-regular.woff b/assets/fonts/mulish-v5-latin-regular.woff new file mode 100644 index 000000000..700ec13f5 Binary files /dev/null and b/assets/fonts/mulish-v5-latin-regular.woff differ diff --git a/assets/fonts/mulish-v5-latin-regular.woff2 b/assets/fonts/mulish-v5-latin-regular.woff2 new file mode 100644 index 000000000..b244298bf Binary files /dev/null and b/assets/fonts/mulish-v5-latin-regular.woff2 differ diff --git a/assets/fonts/mulish-variablefont_wght.woff b/assets/fonts/mulish-variablefont_wght.woff new file mode 100644 index 000000000..fc425383a Binary files /dev/null and b/assets/fonts/mulish-variablefont_wght.woff differ diff --git a/assets/fonts/mulish-variablefont_wght.woff2 b/assets/fonts/mulish-variablefont_wght.woff2 new file mode 100644 index 000000000..8a233c6f9 Binary files /dev/null and b/assets/fonts/mulish-variablefont_wght.woff2 differ diff --git a/assets/images/carpentries-logo-sm.svg b/assets/images/carpentries-logo-sm.svg new file mode 100644 index 000000000..da70d40ee --- /dev/null +++ b/assets/images/carpentries-logo-sm.svg @@ -0,0 +1,7 @@ + + + + + + + \ No newline at end of file diff --git a/assets/images/carpentries-logo.svg b/assets/images/carpentries-logo.svg new file mode 100644 index 000000000..6cbe66500 --- /dev/null +++ b/assets/images/carpentries-logo.svg @@ -0,0 +1,19 @@ + + + + + + + + + + + + + + + + + + + diff --git a/assets/images/data-logo-sm.svg b/assets/images/data-logo-sm.svg new file mode 100644 index 000000000..cf489be84 --- /dev/null +++ b/assets/images/data-logo-sm.svg @@ -0,0 +1,5 @@ + + + + + diff --git a/assets/images/data-logo.svg b/assets/images/data-logo.svg new file mode 100644 index 000000000..cf489be84 --- /dev/null +++ b/assets/images/data-logo.svg @@ -0,0 +1,5 @@ + + + + + diff --git a/assets/images/dropdown-arrow.svg b/assets/images/dropdown-arrow.svg new file mode 100644 index 000000000..a12b04b34 --- /dev/null +++ b/assets/images/dropdown-arrow.svg @@ -0,0 +1,12 @@ + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Discussion

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + +

Please see our other R +lesson for a different presentation of these concepts.

+ + +
+
+ + +
+
+ + + diff --git a/docsearch.css b/docsearch.css new file mode 100644 index 000000000..e5f1fe1df --- /dev/null +++ b/docsearch.css @@ -0,0 +1,148 @@ +/* Docsearch -------------------------------------------------------------- */ +/* + Source: https://github.com/algolia/docsearch/ + License: MIT +*/ + +.algolia-autocomplete { + display: block; + -webkit-box-flex: 1; + -ms-flex: 1; + flex: 1 +} + +.algolia-autocomplete .ds-dropdown-menu { + width: 100%; + min-width: none; + max-width: none; + padding: .75rem 0; + background-color: #fff; + background-clip: padding-box; + border: 1px solid rgba(0, 0, 0, .1); + box-shadow: 0 .5rem 1rem rgba(0, 0, 0, .175); +} + +@media (min-width:768px) { + .algolia-autocomplete .ds-dropdown-menu { + width: 175% + } +} + +.algolia-autocomplete .ds-dropdown-menu::before { + display: none +} + +.algolia-autocomplete .ds-dropdown-menu [class^=ds-dataset-] { + padding: 0; + background-color: rgb(255,255,255); + border: 0; + max-height: 80vh; +} + +.algolia-autocomplete .ds-dropdown-menu .ds-suggestions { + margin-top: 0 +} + +.algolia-autocomplete .algolia-docsearch-suggestion { + padding: 0; + overflow: visible +} + +.algolia-autocomplete .algolia-docsearch-suggestion--category-header { + padding: .125rem 1rem; + margin-top: 0; + font-size: 1.3em; + font-weight: 500; + color: #00008B; + border-bottom: 0 +} + +.algolia-autocomplete .algolia-docsearch-suggestion--wrapper { + float: none; + padding-top: 0 +} + +.algolia-autocomplete .algolia-docsearch-suggestion--subcategory-column { + float: none; + width: auto; + padding: 0; + text-align: left +} + +.algolia-autocomplete .algolia-docsearch-suggestion--content { + float: none; + width: auto; + padding: 0 +} + +.algolia-autocomplete .algolia-docsearch-suggestion--content::before { + display: none +} + +.algolia-autocomplete .ds-suggestion:not(:first-child) .algolia-docsearch-suggestion--category-header { + padding-top: .75rem; + margin-top: .75rem; + border-top: 1px solid rgba(0, 0, 0, .1) +} + +.algolia-autocomplete .ds-suggestion .algolia-docsearch-suggestion--subcategory-column { + display: block; + padding: .1rem 1rem; + margin-bottom: 0.1; + font-size: 1.0em; + font-weight: 400 + /* display: none */ +} + +.algolia-autocomplete .algolia-docsearch-suggestion--title { + display: block; + padding: .25rem 1rem; + margin-bottom: 0; + font-size: 0.9em; + font-weight: 400 +} + +.algolia-autocomplete .algolia-docsearch-suggestion--text { + padding: 0 1rem .5rem; + margin-top: -.25rem; + font-size: 0.8em; + font-weight: 400; + line-height: 1.25 +} + +.algolia-autocomplete .algolia-docsearch-footer { + width: 110px; + height: 20px; + z-index: 3; + margin-top: 10.66667px; + float: right; + font-size: 0; + line-height: 0; +} + +.algolia-autocomplete .algolia-docsearch-footer--logo { + background-image: url("data:image/svg+xml;utf8,"); + background-repeat: no-repeat; + background-position: 50%; + background-size: 100%; + overflow: hidden; + text-indent: -9000px; + width: 100%; + height: 100%; + display: block; + transform: translate(-8px); +} + +.algolia-autocomplete .algolia-docsearch-suggestion--highlight { + color: #FF8C00; + background: rgba(232, 189, 54, 0.1) +} + + +.algolia-autocomplete .algolia-docsearch-suggestion--text .algolia-docsearch-suggestion--highlight { + box-shadow: inset 0 -2px 0 0 rgba(105, 105, 105, .5) +} + +.algolia-autocomplete .ds-suggestion.ds-cursor .algolia-docsearch-suggestion--content { + background-color: rgba(192, 192, 192, .15) +} diff --git a/docsearch.js b/docsearch.js new file mode 100644 index 000000000..b35504cd3 --- /dev/null +++ b/docsearch.js @@ -0,0 +1,85 @@ +$(function() { + + // register a handler to move the focus to the search bar + // upon pressing shift + "/" (i.e. "?") + $(document).on('keydown', function(e) { + if (e.shiftKey && e.keyCode == 191) { + e.preventDefault(); + $("#search-input").focus(); + } + }); + + $(document).ready(function() { + // do keyword highlighting + /* modified from https://jsfiddle.net/julmot/bL6bb5oo/ */ + var mark = function() { + + var referrer = document.URL ; + var paramKey = "q" ; + + if (referrer.indexOf("?") !== -1) { + var qs = referrer.substr(referrer.indexOf('?') + 1); + var qs_noanchor = qs.split('#')[0]; + var qsa = qs_noanchor.split('&'); + var keyword = ""; + + for (var i = 0; i < qsa.length; i++) { + var currentParam = qsa[i].split('='); + + if (currentParam.length !== 2) { + continue; + } + + if (currentParam[0] == paramKey) { + keyword = decodeURIComponent(currentParam[1].replace(/\+/g, "%20")); + } + } + + if (keyword !== "") { + $(".contents").unmark({ + done: function() { + $(".contents").mark(keyword); + } + }); + } + } + }; + + mark(); + }); +}); + +/* Search term highlighting ------------------------------*/ + +function matchedWords(hit) { + var words = []; + + var hierarchy = hit._highlightResult.hierarchy; + // loop to fetch from lvl0, lvl1, etc. + for (var idx in hierarchy) { + words = words.concat(hierarchy[idx].matchedWords); + } + + var content = hit._highlightResult.content; + if (content) { + words = words.concat(content.matchedWords); + } + + // return unique words + var words_uniq = [...new Set(words)]; + return words_uniq; +} + +function updateHitURL(hit) { + + var words = matchedWords(hit); + var url = ""; + + if (hit.anchor) { + url = hit.url_without_anchor + '?q=' + escape(words.join(" ")) + '#' + hit.anchor; + } else { + url = hit.url + '?q=' + escape(words.join(" ")); + } + + return url; +} diff --git a/favicon-16x16.png b/favicon-16x16.png new file mode 100644 index 000000000..d44f8acb4 Binary files /dev/null and b/favicon-16x16.png differ diff --git a/favicon-32x32.png b/favicon-32x32.png new file mode 100644 index 000000000..63441d4c3 Binary files /dev/null and b/favicon-32x32.png differ diff --git a/favicons/cp/apple-touch-icon-114x114.png b/favicons/cp/apple-touch-icon-114x114.png new file mode 100644 index 000000000..a60b75810 Binary files /dev/null and b/favicons/cp/apple-touch-icon-114x114.png differ diff --git a/favicons/cp/apple-touch-icon-120x120.png b/favicons/cp/apple-touch-icon-120x120.png new file mode 100644 index 000000000..8f20a8f12 Binary files /dev/null and b/favicons/cp/apple-touch-icon-120x120.png differ diff --git a/favicons/cp/apple-touch-icon-144x144.png b/favicons/cp/apple-touch-icon-144x144.png new file mode 100644 index 000000000..4be151b14 Binary files /dev/null and b/favicons/cp/apple-touch-icon-144x144.png differ diff --git a/favicons/cp/apple-touch-icon-152x152.png b/favicons/cp/apple-touch-icon-152x152.png new file mode 100644 index 000000000..7d1d94395 Binary files /dev/null and b/favicons/cp/apple-touch-icon-152x152.png differ diff --git a/favicons/cp/apple-touch-icon-57x57.png b/favicons/cp/apple-touch-icon-57x57.png new file mode 100644 index 000000000..92309cef2 Binary files /dev/null and b/favicons/cp/apple-touch-icon-57x57.png differ diff --git a/favicons/cp/apple-touch-icon-60x60.png b/favicons/cp/apple-touch-icon-60x60.png new file mode 100644 index 000000000..de8148e58 Binary files /dev/null and b/favicons/cp/apple-touch-icon-60x60.png differ diff --git a/favicons/cp/apple-touch-icon-72x72.png b/favicons/cp/apple-touch-icon-72x72.png new file mode 100644 index 000000000..81d7e3d83 Binary files /dev/null and b/favicons/cp/apple-touch-icon-72x72.png differ diff --git a/favicons/cp/apple-touch-icon-76x76.png b/favicons/cp/apple-touch-icon-76x76.png new file mode 100644 index 000000000..15bca5c77 Binary files /dev/null and b/favicons/cp/apple-touch-icon-76x76.png differ diff --git a/favicons/cp/favicon-128.png b/favicons/cp/favicon-128.png new file mode 100644 index 000000000..e612cdc15 Binary files /dev/null and b/favicons/cp/favicon-128.png differ diff --git a/favicons/cp/favicon-16x16.png b/favicons/cp/favicon-16x16.png new file mode 100644 index 000000000..65b331112 Binary files /dev/null and b/favicons/cp/favicon-16x16.png differ diff --git a/favicons/cp/favicon-196x196.png b/favicons/cp/favicon-196x196.png new file mode 100644 index 000000000..0da938b27 Binary files /dev/null and b/favicons/cp/favicon-196x196.png differ diff --git a/favicons/cp/favicon-32x32.png b/favicons/cp/favicon-32x32.png new file mode 100644 index 000000000..0c1442e39 Binary files /dev/null and b/favicons/cp/favicon-32x32.png differ diff --git a/favicons/cp/favicon-96x96.png b/favicons/cp/favicon-96x96.png new file mode 100644 index 000000000..bed74ec8d Binary files /dev/null and b/favicons/cp/favicon-96x96.png differ diff --git a/favicons/cp/favicon.ico b/favicons/cp/favicon.ico new file mode 100644 index 000000000..4f2f2f11f Binary files /dev/null and b/favicons/cp/favicon.ico differ diff --git a/favicons/cp/mstile-144x144.png b/favicons/cp/mstile-144x144.png new file mode 100644 index 000000000..4be151b14 Binary files /dev/null and b/favicons/cp/mstile-144x144.png differ diff --git a/favicons/cp/mstile-150x150.png b/favicons/cp/mstile-150x150.png new file mode 100644 index 000000000..bf7ad5e79 Binary files /dev/null and b/favicons/cp/mstile-150x150.png differ diff --git a/favicons/cp/mstile-310x150.png b/favicons/cp/mstile-310x150.png new file mode 100644 index 000000000..6ac804843 Binary files /dev/null and b/favicons/cp/mstile-310x150.png differ diff --git a/favicons/cp/mstile-310x310.png b/favicons/cp/mstile-310x310.png new file mode 100644 index 000000000..b77814750 Binary files /dev/null and b/favicons/cp/mstile-310x310.png differ diff --git a/favicons/cp/mstile-70x70.png b/favicons/cp/mstile-70x70.png new file mode 100644 index 000000000..e612cdc15 Binary files /dev/null and b/favicons/cp/mstile-70x70.png differ diff --git a/favicons/dc/apple-touch-icon-114x114.png b/favicons/dc/apple-touch-icon-114x114.png new file mode 100644 index 000000000..edafbda13 Binary files /dev/null and b/favicons/dc/apple-touch-icon-114x114.png differ diff --git a/favicons/dc/apple-touch-icon-120x120.png b/favicons/dc/apple-touch-icon-120x120.png new file mode 100644 index 000000000..ee145ec5c Binary files /dev/null and b/favicons/dc/apple-touch-icon-120x120.png differ diff --git a/favicons/dc/apple-touch-icon-144x144.png b/favicons/dc/apple-touch-icon-144x144.png new file mode 100644 index 000000000..bf5070144 Binary files /dev/null and b/favicons/dc/apple-touch-icon-144x144.png differ diff --git a/favicons/dc/apple-touch-icon-152x152.png b/favicons/dc/apple-touch-icon-152x152.png new file mode 100644 index 000000000..bd596c816 Binary files /dev/null and b/favicons/dc/apple-touch-icon-152x152.png differ diff --git a/favicons/dc/apple-touch-icon-57x57.png b/favicons/dc/apple-touch-icon-57x57.png new file mode 100644 index 000000000..61c152735 Binary files /dev/null and b/favicons/dc/apple-touch-icon-57x57.png differ diff --git a/favicons/dc/apple-touch-icon-60x60.png b/favicons/dc/apple-touch-icon-60x60.png new file mode 100644 index 000000000..9daad3633 Binary files /dev/null and b/favicons/dc/apple-touch-icon-60x60.png differ diff --git a/favicons/dc/apple-touch-icon-72x72.png b/favicons/dc/apple-touch-icon-72x72.png new file mode 100644 index 000000000..2069520fc Binary files /dev/null and b/favicons/dc/apple-touch-icon-72x72.png differ diff --git a/favicons/dc/apple-touch-icon-76x76.png b/favicons/dc/apple-touch-icon-76x76.png new file mode 100644 index 000000000..3db01ca7d Binary files /dev/null and b/favicons/dc/apple-touch-icon-76x76.png differ diff --git a/favicons/dc/favicon-128.png b/favicons/dc/favicon-128.png new file mode 100644 index 000000000..9e3de2a49 Binary files /dev/null and b/favicons/dc/favicon-128.png differ diff --git a/favicons/dc/favicon-16x16.png b/favicons/dc/favicon-16x16.png new file mode 100644 index 000000000..4c9f9b8c5 Binary files /dev/null and b/favicons/dc/favicon-16x16.png differ diff --git a/favicons/dc/favicon-196x196.png b/favicons/dc/favicon-196x196.png new file mode 100644 index 000000000..588afc213 Binary files /dev/null and b/favicons/dc/favicon-196x196.png differ diff --git a/favicons/dc/favicon-32x32.png b/favicons/dc/favicon-32x32.png new file mode 100644 index 000000000..9c2ecbfbe Binary files /dev/null and b/favicons/dc/favicon-32x32.png differ diff --git a/favicons/dc/favicon-96x96.png b/favicons/dc/favicon-96x96.png new file mode 100644 index 000000000..ff13fc06e Binary files /dev/null and b/favicons/dc/favicon-96x96.png differ diff --git a/favicons/dc/favicon.ico b/favicons/dc/favicon.ico new file mode 100644 index 000000000..e4715f329 Binary files /dev/null and b/favicons/dc/favicon.ico differ diff --git a/favicons/dc/mstile-144x144.png b/favicons/dc/mstile-144x144.png new file mode 100644 index 000000000..bf5070144 Binary files /dev/null and b/favicons/dc/mstile-144x144.png differ diff --git a/favicons/dc/mstile-150x150.png b/favicons/dc/mstile-150x150.png new file mode 100644 index 000000000..c5844cca3 Binary files /dev/null and b/favicons/dc/mstile-150x150.png differ diff --git a/favicons/dc/mstile-310x150.png b/favicons/dc/mstile-310x150.png new file mode 100644 index 000000000..786813af8 Binary files /dev/null and b/favicons/dc/mstile-310x150.png differ diff --git a/favicons/dc/mstile-310x310.png b/favicons/dc/mstile-310x310.png new file mode 100644 index 000000000..9580653c6 Binary files /dev/null and b/favicons/dc/mstile-310x310.png differ diff --git a/favicons/dc/mstile-70x70.png b/favicons/dc/mstile-70x70.png new file mode 100644 index 000000000..9e3de2a49 Binary files /dev/null and b/favicons/dc/mstile-70x70.png differ diff --git a/favicons/lc/apple-touch-icon-114x114.png b/favicons/lc/apple-touch-icon-114x114.png new file mode 100644 index 000000000..6c83127ca Binary files /dev/null and b/favicons/lc/apple-touch-icon-114x114.png differ diff --git a/favicons/lc/apple-touch-icon-120x120.png b/favicons/lc/apple-touch-icon-120x120.png new file mode 100644 index 000000000..8334648f1 Binary files /dev/null and b/favicons/lc/apple-touch-icon-120x120.png differ diff --git a/favicons/lc/apple-touch-icon-144x144.png b/favicons/lc/apple-touch-icon-144x144.png new file mode 100644 index 000000000..5f32151ed Binary files /dev/null and b/favicons/lc/apple-touch-icon-144x144.png differ diff --git a/favicons/lc/apple-touch-icon-152x152.png b/favicons/lc/apple-touch-icon-152x152.png new file mode 100644 index 000000000..4e5c177ce Binary files /dev/null and b/favicons/lc/apple-touch-icon-152x152.png differ diff --git a/favicons/lc/apple-touch-icon-57x57.png b/favicons/lc/apple-touch-icon-57x57.png new file mode 100644 index 000000000..61f9c9c74 Binary files /dev/null and b/favicons/lc/apple-touch-icon-57x57.png differ diff --git a/favicons/lc/apple-touch-icon-60x60.png b/favicons/lc/apple-touch-icon-60x60.png new file mode 100644 index 000000000..ccb5ada1c Binary files /dev/null and b/favicons/lc/apple-touch-icon-60x60.png differ diff --git a/favicons/lc/apple-touch-icon-72x72.png b/favicons/lc/apple-touch-icon-72x72.png new file mode 100644 index 000000000..517d459af Binary files /dev/null and b/favicons/lc/apple-touch-icon-72x72.png differ diff --git a/favicons/lc/apple-touch-icon-76x76.png b/favicons/lc/apple-touch-icon-76x76.png new file mode 100644 index 000000000..17454b311 Binary files /dev/null and b/favicons/lc/apple-touch-icon-76x76.png differ diff --git a/favicons/lc/favicon-128.png b/favicons/lc/favicon-128.png new file mode 100644 index 000000000..9d781c901 Binary files /dev/null and b/favicons/lc/favicon-128.png differ diff --git a/favicons/lc/favicon-16x16.png b/favicons/lc/favicon-16x16.png new file mode 100644 index 000000000..3c20abcc0 Binary files /dev/null and b/favicons/lc/favicon-16x16.png differ diff --git a/favicons/lc/favicon-196x196.png b/favicons/lc/favicon-196x196.png new file mode 100644 index 000000000..46baaf8f9 Binary files /dev/null and b/favicons/lc/favicon-196x196.png differ diff --git a/favicons/lc/favicon-32x32.png b/favicons/lc/favicon-32x32.png new file mode 100644 index 000000000..ed6701ea1 Binary files /dev/null and b/favicons/lc/favicon-32x32.png differ diff --git a/favicons/lc/favicon-96x96.png b/favicons/lc/favicon-96x96.png new file mode 100644 index 000000000..bc468c73a Binary files /dev/null and b/favicons/lc/favicon-96x96.png differ diff --git a/favicons/lc/favicon.ico b/favicons/lc/favicon.ico new file mode 100644 index 000000000..5c14e8091 Binary files /dev/null and b/favicons/lc/favicon.ico differ diff --git a/favicons/lc/mstile-144x144.png b/favicons/lc/mstile-144x144.png new file mode 100644 index 000000000..5f32151ed Binary files /dev/null and b/favicons/lc/mstile-144x144.png differ diff --git a/favicons/lc/mstile-150x150.png b/favicons/lc/mstile-150x150.png new file mode 100644 index 000000000..924953a84 Binary files /dev/null and b/favicons/lc/mstile-150x150.png differ diff --git a/favicons/lc/mstile-310x150.png b/favicons/lc/mstile-310x150.png new file mode 100644 index 000000000..e4dcda444 Binary files /dev/null and b/favicons/lc/mstile-310x150.png differ diff --git a/favicons/lc/mstile-310x310.png b/favicons/lc/mstile-310x310.png new file mode 100644 index 000000000..a12c87632 Binary files /dev/null and b/favicons/lc/mstile-310x310.png differ diff --git a/favicons/lc/mstile-70x70.png b/favicons/lc/mstile-70x70.png new file mode 100644 index 000000000..9d781c901 Binary files /dev/null and b/favicons/lc/mstile-70x70.png differ diff --git a/favicons/swc/apple-touch-icon-114x114.png b/favicons/swc/apple-touch-icon-114x114.png new file mode 100644 index 000000000..e5125f8c4 Binary files /dev/null and b/favicons/swc/apple-touch-icon-114x114.png differ diff --git a/favicons/swc/apple-touch-icon-120x120.png b/favicons/swc/apple-touch-icon-120x120.png new file mode 100644 index 000000000..0f97a0aec Binary files /dev/null and b/favicons/swc/apple-touch-icon-120x120.png differ diff --git a/favicons/swc/apple-touch-icon-144x144.png b/favicons/swc/apple-touch-icon-144x144.png new file mode 100644 index 000000000..7441446cc Binary files /dev/null and b/favicons/swc/apple-touch-icon-144x144.png differ diff --git a/favicons/swc/apple-touch-icon-152x152.png b/favicons/swc/apple-touch-icon-152x152.png new file mode 100644 index 000000000..45cc338e5 Binary files /dev/null and b/favicons/swc/apple-touch-icon-152x152.png differ diff --git a/favicons/swc/apple-touch-icon-57x57.png b/favicons/swc/apple-touch-icon-57x57.png new file mode 100644 index 000000000..e180a4a32 Binary files /dev/null and b/favicons/swc/apple-touch-icon-57x57.png differ diff --git a/favicons/swc/apple-touch-icon-60x60.png b/favicons/swc/apple-touch-icon-60x60.png new file mode 100644 index 000000000..c96fd6ce7 Binary files /dev/null and b/favicons/swc/apple-touch-icon-60x60.png differ diff --git a/favicons/swc/apple-touch-icon-72x72.png b/favicons/swc/apple-touch-icon-72x72.png new file mode 100644 index 000000000..aae014aa7 Binary files /dev/null and b/favicons/swc/apple-touch-icon-72x72.png differ diff --git a/favicons/swc/apple-touch-icon-76x76.png b/favicons/swc/apple-touch-icon-76x76.png new file mode 100644 index 000000000..2167f94a7 Binary files /dev/null and b/favicons/swc/apple-touch-icon-76x76.png differ diff --git a/favicons/swc/favicon-128.png b/favicons/swc/favicon-128.png new file mode 100644 index 000000000..f61df620c Binary files /dev/null and b/favicons/swc/favicon-128.png differ diff --git a/favicons/swc/favicon-16x16.png b/favicons/swc/favicon-16x16.png new file mode 100644 index 000000000..2d20a4061 Binary files /dev/null and b/favicons/swc/favicon-16x16.png differ diff --git a/favicons/swc/favicon-196x196.png b/favicons/swc/favicon-196x196.png new file mode 100644 index 000000000..2a20d3a6f Binary files /dev/null and b/favicons/swc/favicon-196x196.png differ diff --git a/favicons/swc/favicon-32x32.png b/favicons/swc/favicon-32x32.png new file mode 100644 index 000000000..f622b73a1 Binary files /dev/null and b/favicons/swc/favicon-32x32.png differ diff --git a/favicons/swc/favicon-96x96.png b/favicons/swc/favicon-96x96.png new file mode 100644 index 000000000..5e57f66a5 Binary files /dev/null and b/favicons/swc/favicon-96x96.png differ diff --git a/favicons/swc/favicon.ico b/favicons/swc/favicon.ico new file mode 100644 index 000000000..f771790f2 Binary files /dev/null and b/favicons/swc/favicon.ico differ diff --git a/favicons/swc/mstile-144x144.png b/favicons/swc/mstile-144x144.png new file mode 100644 index 000000000..7441446cc Binary files /dev/null and b/favicons/swc/mstile-144x144.png differ diff --git a/favicons/swc/mstile-150x150.png b/favicons/swc/mstile-150x150.png new file mode 100644 index 000000000..d1594bcb8 Binary files /dev/null and b/favicons/swc/mstile-150x150.png differ diff --git a/favicons/swc/mstile-310x150.png b/favicons/swc/mstile-310x150.png new file mode 100644 index 000000000..f7d58b2b9 Binary files /dev/null and b/favicons/swc/mstile-310x150.png differ diff --git a/favicons/swc/mstile-310x310.png b/favicons/swc/mstile-310x310.png new file mode 100644 index 000000000..b632b421c Binary files /dev/null and b/favicons/swc/mstile-310x310.png differ diff --git a/favicons/swc/mstile-70x70.png b/favicons/swc/mstile-70x70.png new file mode 100644 index 000000000..f61df620c Binary files /dev/null and b/favicons/swc/mstile-70x70.png differ diff --git a/fig/01-rstudio-script.png b/fig/01-rstudio-script.png new file mode 100644 index 000000000..babbd2949 Binary files /dev/null and b/fig/01-rstudio-script.png differ diff --git a/fig/01-rstudio.png b/fig/01-rstudio.png new file mode 100644 index 000000000..0840386af Binary files /dev/null and b/fig/01-rstudio.png differ diff --git a/fig/06-rmd-generate-figures.sh b/fig/06-rmd-generate-figures.sh new file mode 100755 index 000000000..4cc231322 --- /dev/null +++ b/fig/06-rmd-generate-figures.sh @@ -0,0 +1,7 @@ +inkscape --export-png=06-rmd-inequality.0.png 06-rmd-inequality.0.svg +# use ImageMagick to grab top and bottom halves +# (surely there's a better way ... too much space at the bottom of the first) +convert 06-rmd-inequality.0.png -crop 100%x50% tmp.png +mv tmp-0.png 06-rmd-inequality.1.png +mv tmp-1.png 06-rmd-inequality.2.png + diff --git a/fig/06-rmd-inequality.0.png b/fig/06-rmd-inequality.0.png new file mode 100644 index 000000000..aa6d3f1e9 Binary files /dev/null and b/fig/06-rmd-inequality.0.png differ diff --git a/fig/06-rmd-inequality.0.svg b/fig/06-rmd-inequality.0.svg new file mode 100644 index 000000000..b8953dcbe --- /dev/null +++ b/fig/06-rmd-inequality.0.svg @@ -0,0 +1,311 @@ + + + + + + + + + + image/svg+xml + + + + + + + c("a", "b", "c") + c("a", "c") + + + + + + + FALSE + TRUE + ?? + ?? + c("a", "b", "c") + c("a", "c") + + + + + + + FALSE + TRUE + c("a"... + FALSE + != + != + + diff --git a/fig/06-rmd-inequality.1.png b/fig/06-rmd-inequality.1.png new file mode 100644 index 000000000..580038505 Binary files /dev/null and b/fig/06-rmd-inequality.1.png differ diff --git a/fig/06-rmd-inequality.2.png b/fig/06-rmd-inequality.2.png new file mode 100644 index 000000000..d3f438dc1 Binary files /dev/null and b/fig/06-rmd-inequality.2.png differ diff --git a/fig/08-plot-ggplot2-rendered-axis-scale-1.png b/fig/08-plot-ggplot2-rendered-axis-scale-1.png new file mode 100644 index 000000000..3349b33e9 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-axis-scale-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-blank-ggplot-1.png b/fig/08-plot-ggplot2-rendered-blank-ggplot-1.png new file mode 100644 index 000000000..14c48d3bf Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-blank-ggplot-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch1-sol-1.png b/fig/08-plot-ggplot2-rendered-ch1-sol-1.png new file mode 100644 index 000000000..6dcaa2a72 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch1-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch2-sol-1.png b/fig/08-plot-ggplot2-rendered-ch2-sol-1.png new file mode 100644 index 000000000..4559e25e1 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch2-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch3-sol-1.png b/fig/08-plot-ggplot2-rendered-ch3-sol-1.png new file mode 100644 index 000000000..8af5f41fa Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch3-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch4a-sol-1.png b/fig/08-plot-ggplot2-rendered-ch4a-sol-1.png new file mode 100644 index 000000000..0f755f4c7 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch4a-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch4b-sol-1.png b/fig/08-plot-ggplot2-rendered-ch4b-sol-1.png new file mode 100644 index 000000000..e07c93aac Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch4b-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch5-sol-1.png b/fig/08-plot-ggplot2-rendered-ch5-sol-1.png new file mode 100644 index 000000000..a2cd52f11 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch5-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-facet-1.png b/fig/08-plot-ggplot2-rendered-facet-1.png new file mode 100644 index 000000000..bc7c4d02c Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-facet-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ggplot-with-aes-1.png b/fig/08-plot-ggplot2-rendered-ggplot-with-aes-1.png new file mode 100644 index 000000000..70214d8b0 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ggplot-with-aes-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-layer-example-1-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-layer-example-1-1.png new file mode 100644 index 000000000..2238582b0 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-layer-example-1-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-line-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-line-1.png new file mode 100644 index 000000000..a915f00ba Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-line-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-line-by-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-line-by-1.png new file mode 100644 index 000000000..56157b9b6 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-line-by-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-line-point-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-line-point-1.png new file mode 100644 index 000000000..c02ce6add Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-line-point-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter-1.png new file mode 100644 index 000000000..44db5466d Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter3-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter3-1.png new file mode 100644 index 000000000..44db5466d Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter3-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lm-fit-1.png b/fig/08-plot-ggplot2-rendered-lm-fit-1.png new file mode 100644 index 000000000..f819c105f Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lm-fit-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lm-fit2-1.png b/fig/08-plot-ggplot2-rendered-lm-fit2-1.png new file mode 100644 index 000000000..d93c05d0e Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lm-fit2-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-theme-1.png b/fig/08-plot-ggplot2-rendered-theme-1.png new file mode 100644 index 000000000..a9bd55f56 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-theme-1.png differ diff --git a/fig/09-vectorization-rendered-ch2-sol-1.png b/fig/09-vectorization-rendered-ch2-sol-1.png new file mode 100644 index 000000000..99fe38be7 Binary files /dev/null and b/fig/09-vectorization-rendered-ch2-sol-1.png differ diff --git a/fig/09-vectorization-rendered-ch2-sol-2.png b/fig/09-vectorization-rendered-ch2-sol-2.png new file mode 100644 index 000000000..5b630819b Binary files /dev/null and b/fig/09-vectorization-rendered-ch2-sol-2.png differ diff --git a/fig/12-plyr-fig1.png b/fig/12-plyr-fig1.png new file mode 100644 index 000000000..249bab4fa Binary files /dev/null and b/fig/12-plyr-fig1.png differ diff --git a/fig/12-plyr-fig1.tex b/fig/12-plyr-fig1.tex new file mode 100644 index 000000000..ded41a78c --- /dev/null +++ b/fig/12-plyr-fig1.tex @@ -0,0 +1,143 @@ +\documentclass[convert]{standalone} + +\usepackage{tikz} +\usepackage{colortbl} +\renewcommand{\familydefault}{\sfdefault} + +\begin{document} + +\begin{tikzpicture} + +% Headings + +\node (INPUT-LABEL) at (0, 5) {Input Data}; +\node (GROUP-LABEL) at (3, 5) {Split}; +\node (SUMMARY-LABEL) at (6, 5) {Apply}; +\node (OUTPUT-LABEL) at (9, 5) {Combine}; + + +% Data Nodes + +\node (INPUT) at (0, 2) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + a & 2 \\ \hline + a & 4 \\ \hline + b & 0 \\ \hline + b & 5 \\ \hline + c & 5 \\ \hline + c & 10 \\ \hline + \end{tabular} + +}; + +\node (GROUP-A) at (3, 4) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + a & 2 \\ \hline + a & 4 \\ \hline + \end{tabular} + +}; + +\node (GROUP-B) at (3, 2) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + b & 0 \\ \hline + b & 5 \\ \hline + \end{tabular} + +}; + +\node (GROUP-C) at (3, 0) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + c & 5 \\ \hline + c & 10 \\ \hline + \end{tabular} + +}; + +\node (SUMMARY-A) at (6, 4) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + a & 3.0 \\ \hline + \end{tabular} + +}; + +\node (SUMMARY-B) at (6, 2) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + b & 2.5 \\ \hline + \end{tabular} + +}; + +\node (SUMMARY-C) at (6, 0) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + c & 7.5 \\ \hline + \end{tabular} + +}; + +\node (OUPUT) at (9, 2) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + a & 3.0 \\ \hline + b & 2.5 \\ \hline + c & 7.5 \\ \hline + \end{tabular} + +}; + + +% Arrows + +\draw[->, to path={-> (\tikztotarget)}] + (INPUT) edge (GROUP-A) + (INPUT) edge (GROUP-B) + (INPUT) edge (GROUP-C) + + (GROUP-A) edge (SUMMARY-A) + (GROUP-B) edge (SUMMARY-B) + (GROUP-C) edge (SUMMARY-C) + + (SUMMARY-A) edge (OUPUT) + (SUMMARY-B) edge (OUPUT) + (SUMMARY-C) edge (OUPUT) +; + +\end{tikzpicture} + +\end{document} + +%------------------------ +% References +% https://tex.stackexchange.com/questions/251642/draw-arrows-between-nodes-with-tikz +% https://tex.stackexchange.com/questions/11866/compile-a-latex-document-into-a-png-image-thats-as-short-as-possible diff --git a/fig/12-plyr-fig2.png b/fig/12-plyr-fig2.png new file mode 100644 index 000000000..d00d25f5c Binary files /dev/null and b/fig/12-plyr-fig2.png differ diff --git a/fig/12-plyr-fig2.tex b/fig/12-plyr-fig2.tex new file mode 100644 index 000000000..56fdfcd3f --- /dev/null +++ b/fig/12-plyr-fig2.tex @@ -0,0 +1,64 @@ +\documentclass[convert]{standalone} + +\usepackage{array} +\usepackage{multirow} +\usepackage{rotating} +\usepackage{colortbl} +\renewcommand{\familydefault}{\sfdefault} +\renewcommand{\arraystretch}{2.2} + +\begin{document} + +\begin{tabular}{crccccc} + +& +& \multicolumn{4}{c}{Output} +\\ + +& %\cellcolor[gray]{0.7} +& \cellcolor[gray]{0.7}array +& \cellcolor[gray]{0.7}data frame +& \cellcolor[gray]{0.7}list +& \cellcolor[gray]{0.7}nothing +\\ + +& \cellcolor[gray]{0.7}array +& aaply +& adply +& alply +& a\_ply +\\ + +& \cellcolor[gray]{0.7}data frame +& daply +& ddply +& dlply +& d\_ply +\\ + +& \cellcolor[gray]{0.7} list +& laply +& ldply +& llply +& l\_ply +\\ + +& \cellcolor[gray]{0.7}n replicates +& raply +& rdply +& rlply +& r\_ply +\\ + +\multirow{-5}{*}{\rotatebox[origin=c]{90}{Input}} +& \cellcolor[gray]{0.7}function arguments +& maply +& mdply +& mlply +& m\_ply +\\ + +\end{tabular} + + +\end{document} diff --git a/fig/12-plyr-generate-figures.sh b/fig/12-plyr-generate-figures.sh new file mode 100755 index 000000000..9236d34e0 --- /dev/null +++ b/fig/12-plyr-generate-figures.sh @@ -0,0 +1,10 @@ +#! /bin/bash + +pdflatex -shell-escape 12-plyr-fig1.tex + +rm 12-plyr-fig1.aux 12-plyr-fig1.log 12-plyr-fig1.pdf + +pdflatex -shell-escape 12-plyr-fig2.tex + +rm 12-plyr-fig2.aux 12-plyr-fig2.log 12-plyr-fig2.pdf + diff --git a/fig/13-dplyr-fig1.png b/fig/13-dplyr-fig1.png new file mode 100644 index 000000000..7f3067a3c Binary files /dev/null and b/fig/13-dplyr-fig1.png differ diff --git a/fig/13-dplyr-fig2.png b/fig/13-dplyr-fig2.png new file mode 100644 index 000000000..caa86d462 Binary files /dev/null and b/fig/13-dplyr-fig2.png differ diff --git a/fig/13-dplyr-fig3.png b/fig/13-dplyr-fig3.png new file mode 100644 index 000000000..ae00ce386 Binary files /dev/null and b/fig/13-dplyr-fig3.png differ diff --git a/fig/13-dplyr-generate-figures.R b/fig/13-dplyr-generate-figures.R new file mode 100644 index 000000000..4c3f4b223 --- /dev/null +++ b/fig/13-dplyr-generate-figures.R @@ -0,0 +1,383 @@ +# export figures manually +library(DiagrammeR) +##################################### 13-dplyr-fig2.png #####################################) +grViz('digraph html { + table1 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
abcd
>]; + + table2 [shape=none, margin=0, label=< + + + + + + + + + + + + + + + + + + + + +
ac
>]; + + table1:f1:s -> table2:f1:s + table1:f0:n -> table2:f0:n + + subgraph { + rank = same; table1; table2; + } + + labelloc="t"; + fontname="Courier"; + label="select(data.frame, a, c)"; + } + ') + +##################################### 13-dplyr-fig2.png ##################################### +grViz('digraph html { + rankdir=LR; + table1 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
abcd
1
1
2
2
3
3
>]; + table2 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + +
abcd
1
1
>]; + table3 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
abcd
2
2
>]; + table4 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
abcd
3
3
>]; + + table1:f0 -> table2:f0 + table1:f1 -> table3:f1 + table1:f2 -> table4:f2 + + + subgraph { + rank = same; table2; table3 ;table4; + } + + labelloc="t"; + fontname="Courier"; + label="gapminder %>%\\l\tgroup_by(a)"; + } + ') + +##################################### 13-dplyr-fig3.png ##################################### +grViz('digraph html { + rankdir=LR; + + table1 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
abcd
1
1
2
2
3
3
>]; + + table2 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
abcd
1
1
>]; + + table3 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
abcd
2
2
>]; + + table4 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
abcd
3
3
>]; + + table5 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + +
amean_b
1
2
3
>]; + + + table1:f0 -> table2:f0 + table1:f1 -> table3:f1 + table1:f2 -> table4:f2 + table2:f3:n -> table5:f0 + table3:f3 -> table5:f1 + table4:f3 -> table5:f2:w + + subgraph { + table1; table2; table3 ;table4; table5 + } + + subgraph { + rank = same; table2; table3; table4; + } + + labelloc="t"; + fontname="Courier"; + label="gapminder %>%\\l\tgroup_by(a) %>%\\l\tsummarize(mean_b=mean(b))\\l"; + } + ') diff --git a/fig/13-dplyr-rendered-unnamed-chunk-27-1.png b/fig/13-dplyr-rendered-unnamed-chunk-27-1.png new file mode 100644 index 000000000..bc7c4d02c Binary files /dev/null and b/fig/13-dplyr-rendered-unnamed-chunk-27-1.png differ diff --git a/fig/13-dplyr-rendered-unnamed-chunk-28-1.png b/fig/13-dplyr-rendered-unnamed-chunk-28-1.png new file mode 100644 index 000000000..bc7c4d02c Binary files /dev/null and b/fig/13-dplyr-rendered-unnamed-chunk-28-1.png differ diff --git a/fig/13-dplyr-rendered-unnamed-chunk-29-1.png b/fig/13-dplyr-rendered-unnamed-chunk-29-1.png new file mode 100644 index 000000000..75472eaee Binary files /dev/null and b/fig/13-dplyr-rendered-unnamed-chunk-29-1.png differ diff --git a/fig/14-tidyr-fig1.png b/fig/14-tidyr-fig1.png new file mode 100644 index 000000000..4ce006667 Binary files /dev/null and b/fig/14-tidyr-fig1.png differ diff --git a/fig/14-tidyr-fig2.png b/fig/14-tidyr-fig2.png new file mode 100644 index 000000000..7287d0194 Binary files /dev/null and b/fig/14-tidyr-fig2.png differ diff --git a/fig/14-tidyr-fig3.png b/fig/14-tidyr-fig3.png new file mode 100644 index 000000000..4c13aa57d Binary files /dev/null and b/fig/14-tidyr-fig3.png differ diff --git a/fig/14-tidyr-fig3.svg b/fig/14-tidyr-fig3.svg new file mode 100644 index 000000000..6f756ef15 --- /dev/null +++ b/fig/14-tidyr-fig3.svg @@ -0,0 +1,269 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +ID +a1 +a2 +a3 +ID +2 +1 +3 +2 +1 +3 +2 +1 +3 +2 +1 +3 +a1 +a2 +a3 +a1 +a2 +a3 +a1 +a2 +a3 +key +value +wide format +long format +pivot_longer(data, cols = c("a1", "a2", "a3"), names_to = "key", values_to = "value") + + + + + + + + + + + + + + + + +ID +a1 +a2 +a3 +2 +1 +3 + + + + +ID +2 +1 +3 + + + + +ID +2 +1 +3 + + + + + + + + + + + + + + + + +ID +a1 +a2 +a3 +2 +1 +3 + + + + +ID +2 +1 +3 + + + + +ID +2 +1 +3 + + + + + + + + + + + + + +ID +2 +1 +3 + + + + +ID +2 +1 +3 + + + + +ID +2 +1 +3 + + + +a1 +a2 +a3 + + + +a1 +a2 +a3 + + + +a1 +a2 +a3 + + + + + + + + +separate byselected columns + + + + + + + + +convert column names to column + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +name columns with keyand value arguments + diff --git a/fig/14-tidyr-fig4.png b/fig/14-tidyr-fig4.png new file mode 100644 index 000000000..fd5d68c64 Binary files /dev/null and b/fig/14-tidyr-fig4.png differ diff --git a/fig/14-tidyr-generate-figures.R b/fig/14-tidyr-generate-figures.R new file mode 100644 index 000000000..9ed954ae4 --- /dev/null +++ b/fig/14-tidyr-generate-figures.R @@ -0,0 +1,385 @@ +# export figures manually +library(DiagrammeR) +##################################### 14-tidyr-fig1.png ##################################### +grViz('digraph html { + + table1 [shape=none, margin=0, label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDa1a2a3
>]; + + table2 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDID2A
1a1
2a1
3a1
1a2
2a2
3a2
1a3
2a3
3a3
>]; + + subgraph { + rank = same; table1; table2; + } + + + labelloc="t"; + fontname="Courier"; + label="wide vs long"; + } + ') + +##################################### 14-tidyr-fig2.png ##################################### +grViz('digraph html { + table1 [shape=none, margin=0, label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
continentcountrygdpPercap_1952gdpPercap_1957gdpPercap_...lifeExp_1952lifeExp_1957lifeExp_...pop_1952pop_1957pop_...
AfricaAlgeria
AfricaAngola
......
>]; + + labelloc="t"; + fontname="Courier"; + label="wide format"; + } + ') + +##################################### 14-tidyr-fig3.png ##################################### +grViz('digraph html { + + table1 [shape=none, margin=0, label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
continentcountryobstype_yearobs_value
AfricaAlgeriagdpPercap_1952
AfricaAlgeriagdpPercap_1957
AfricaAlgeriagdpPercap_...
AfricaAlgerialifeExp_1952
AfricaAlgerialifeExp_1957
AfricaAlgerialifeExp_...
AfricaAlgeriapop_1952
AfricaAlgeriapop_1957
AfricaAlgeriapop_...
AfricaAngolagdpPercap_1952
AfricaAngolagdpPercap_1957
AfricaAngolagdpPercap_...
AfricaAngolalifeExp_1952
AfricaAngolalifeExp_1957
AfricaAngolalifeExp_...
AfricaAngolapop_1952
AfricaAngolapop_1957
AfricaAngolapop_...
Africa...gdpPercap_1952
Africa...gdpPercap_1957
Africa...gdpPercap_...
Africa...lifeExp_1952
Africa...lifeExp_1957
Africa...lifeExp_...
Africa...pop_1952
Africa...pop_1957
Africa...pop_...
>]; + + labelloc="t"; + fontname="Courier"; + label="long format"; + } + ') diff --git a/fig/15-knitr-markdown-rendered-rmd_to_html_fig-1.png b/fig/15-knitr-markdown-rendered-rmd_to_html_fig-1.png new file mode 100644 index 000000000..976f7bc79 Binary files /dev/null and b/fig/15-knitr-markdown-rendered-rmd_to_html_fig-1.png differ diff --git a/fig/New_R_Markdown.png b/fig/New_R_Markdown.png new file mode 100644 index 000000000..8542fe9bd Binary files /dev/null and b/fig/New_R_Markdown.png differ diff --git a/fig/bad_layout.png b/fig/bad_layout.png new file mode 100644 index 000000000..fcfda0c5a Binary files /dev/null and b/fig/bad_layout.png differ diff --git a/fig/rmd-06-equality.0.svg b/fig/rmd-06-equality.0.svg new file mode 100644 index 000000000..9671b0b3e --- /dev/null +++ b/fig/rmd-06-equality.0.svg @@ -0,0 +1,288 @@ + + + + + + + + + + image/svg+xml + + + + + + + c("a", "a", "a") + c("a", "c") + + + + + + + TRUE + FALSE + ?? + ?? + c("a", "a", "a") + c("a", "c") + + + + + + + TRUE + FALSE + c("a"... + TRUE + + diff --git a/fig/rmd-06-equality.1.png b/fig/rmd-06-equality.1.png new file mode 100644 index 000000000..f4152a338 Binary files /dev/null and b/fig/rmd-06-equality.1.png differ diff --git a/fig/rmd-06-equality.2.png b/fig/rmd-06-equality.2.png new file mode 100644 index 000000000..e33f4cf4f Binary files /dev/null and b/fig/rmd-06-equality.2.png differ diff --git a/fig/software-carpentry-banner.png b/fig/software-carpentry-banner.png new file mode 100644 index 000000000..746a9c53c Binary files /dev/null and b/fig/software-carpentry-banner.png differ diff --git a/fig/visual_mode_icon.png b/fig/visual_mode_icon.png new file mode 100644 index 000000000..d224e3cee Binary files /dev/null and b/fig/visual_mode_icon.png differ diff --git a/images.html b/images.html new file mode 100644 index 000000000..0fd00617b --- /dev/null +++ b/images.html @@ -0,0 +1,647 @@ + + + + + +R for Reproducible Scientific Analysis: All Images + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + + +
+
+ + +

Introduction to R and RStudio

+
+

Figure 1

+ +
RStudio layout

+

Figure 2

+ +
RStudio layout with .R file open

Project Management With RStudio

+
+

Figure 1

+ +
Screenshot of file manager demonstrating bad project organisation

Seeking Help

+

Data Structures

+

Exploring Data Frames

+

Subsetting Data

+
+

Figure 1

+ +
Inequality testing

+

Figure 2

+ +
Inequality testing: results of recycling

Control Flow

+

Creating Publication-Quality Graphics with ggplot2

+
+

Figure 1

+ +
Blank plot, before adding any mapping aesthetics to ggplot().

+

Figure 2

+ +
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

+

Figure 3

+ +
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

+

Figure 4

+ +
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +

+

Figure 5

+ +
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +

+

Figure 6

+

+

Figure 7

+

+

Figure 8

+

+

Figure 9

+

+

Figure 10

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

+

Figure 11

+

+

Figure 12

+ +
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

+

Figure 13

+ +
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

+

Figure 14

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

+

Figure 15

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

+

Figure 16

+

+

Figure 17

+

+

Figure 18

+

+

Figure 19

+

Vectorization

+
+

Figure 1

+ +
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

+

Figure 2

+ +
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

Functions Explained

+

Writing Data

+

Splitting and Combining Data Frames with plyr

+
+

Figure 1

+ +
Split apply combine

+

Figure 2

+ +
Full apply suite

Data Frame Manipulation with dplyr

+
+

Figure 1

+ +

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.

+
+

Figure 2

+ +
Diagram illustrating how the group by function oraganizes a data frame into groups

+

Figure 3

+ +
Diagram illustrating the use of group by and summarize together to create a new variable

+

Figure 4

+

+

Figure 5

+

+

Figure 6

+

Data Frame Manipulation with tidyr

+
+

Figure 1

+ +
Diagram illustrating the difference between a wide versus long layout of a data frame

+

Figure 2

+ +
Diagram illustrating the wide format of the gapminder data frame

+

Figure 3

+ +
Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format

+

Figure 4

+ +
Diagram illustrating the long format of the gapminder data

Producing Reports With knitr

+
+

Figure 1

+ +
Screenshot of the New R Markdown file dialogue box in RStudio

+

Figure 2

+

+

Figure 3

+

RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.

+

Writing Good Software

+
+
+
+
+ + +
+ + +
+ + + + + diff --git a/index.html b/index.html new file mode 100644 index 000000000..e07b6bb42 --- /dev/null +++ b/index.html @@ -0,0 +1,464 @@ + +R for Reproducible Scientific Analysis: Summary and Setup +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+

Summary and Setup

+ + +

an introduction to R for non-programmers using gapminder +data

+

The goal of this lesson is to teach novice programmers to write +modular code and best practices for using R for data analysis. R is +commonly used in many scientific disciplines for statistical analysis +and its array of third-party packages. We find that many scientists who +come to Software Carpentry workshops use R and want to learn more. The +emphasis of these materials is to give attendees a strong foundation in +the fundamentals of R, and to teach best practices for scientific +computing: breaking down analyses into modular units, task automation, +and encapsulation.

+

Note that this workshop will focus on teaching the fundamentals of +the programming language R, and will not teach statistical analysis.

+

The lesson contains more material than can be taught in a day. The instructor notes page has some +suggested lesson plans suitable for a one or half day workshop.

+

A variety of third party packages are used throughout this workshop. +These are not necessarily the best, nor are they comprehensive, but they +are packages we find useful, and have been chosen primarily for their +usability.

+
+
+ +
+
+

Prerequisites +

+
+

Understand that computers store data and instructions (programs, +scripts etc.) in files. Files are organised in directories (folders). +Know how to access files not in the working directory by specifying the +path.

+
+
+
+ + +

This lesson assumes you have R and RStudio installed on your +computer.

+
+ + +
+
+ + + diff --git a/instructor-notes.html b/instructor-notes.html new file mode 100644 index 000000000..ba32de996 --- /dev/null +++ b/instructor-notes.html @@ -0,0 +1,629 @@ + + + + + +R for Reproducible Scientific Analysis: Instructor Notes + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + + +
+
+

Instructor Notes

+ + +

Timing +

+
+

Leave about 30 minutes at the start of each workshop and another 15 +mins at the start of each session for technical difficulties like WiFi +and installing things (even if you asked students to install in advance, +longer if not).

+

Lesson Plans +

+
+

The lesson contains much more material than can be taught in a day. +Instructors will need to pick an appropriate subset of episodes to use +in a standard one day course.

+

Some suggested paths through the material are:

+

(suggested by @liz-is)

+
    +
  • 01 Introduction to R and RStudio
  • +
  • 04 Data Structures
  • +
  • 05 Exploring Data Frames (“Realistic example” section onwards)
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2
  • +
  • 10 Functions Explained
  • +
  • 13 Dataframe Manipulation with dplyr
  • +
  • 15 Producing Reports With knitr
  • +
+

(suggested by @naupaka)

+
    +
  • 01 Introduction to R and RStudio
  • +
  • 02 Project Management With RStudio
  • +
  • 03 Seeking Help
  • +
  • 04 Data Structures
  • +
  • 05 Exploring Data Frames
  • +
  • 06 Subsetting Data
  • +
  • 09 Vectorization
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2 OR 13 +Dataframe Manipulation with dplyr
  • +
  • 15 Producing Reports With knitr
  • +
+

A half day course could consist of (suggested by @karawoo):

+
    +
  • 01 Introduction to R and RStudio
  • +
  • 04 Data Structures (only creating vectors with +c())
  • +
  • 05 Exploring Data Frames (“Realistic example” section onwards)
  • +
  • 06 Subsetting Data (excluding factor, matrix and list +subsetting)
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2
  • +

Setting up git in RStudio +

+
+

There can be difficulties linking git to RStudio depending on the +operating system and the version of the operating system. To make sure +Git is properly installed and configured, the learners should go to the +Options window in the RStudio application.

+
    +
  • +Mac OS X: +
      +
    • Go RStudio -> Preferences… -> Git/SVN
    • +
    • Check and see whether there is a path to a file in the “Git +executable” window. If not, the next challenge is figuring out where Git +is located.
    • +
    • In the terminal enter which git and you will get a path +to the git executable. In the “Git executable” window you may have +difficulties finding the directory since OS X hides many of the +operating system files. While the file selection window is open, +pressing “Command-Shift-G” will pop up a text entry box where you will +be able to type or paste in the full path to your git executable: +e.g. /usr/bin/git or whatever else it might be.
    • +
    +
  • +
  • +Windows: +
      +
    • Go Tools -> Global options… -> Git/SVN
    • +
    • If you use the Software Carpentry Installer, then ‘git.exe’ should +be installed at C:/Program Files/Git/bin/git.exe.
    • +
    +
  • +
+

To prevent the learners from having to re-enter their password each +time they push a commit to GitHub, this command (which can be run from a +bash prompt) will make it so they only have to enter their password +once:

+
+

BASH +

+
$ git config --global credential.helper 'cache --timeout=10000000'
+
+

Pulling in Data +

+
+

The easiest way to get the data used in this lesson during a workshop +is to have attendees download the raw data from gapminder-data and gapminder-data-wide.

+

Attendees can use the File - Save As dialog in their +browser to save the file.

+

Overall +

+
+

Make sure to emphasize good practices: put code in scripts, and make +sure they’re version controlled. Encourage students to create script +files for challenges.

+

If you’re working in a cloud environment, get them to upload the +gapminder data after the second lesson.

+

Make sure to emphasize that matrices are vectors underneath the hood +and data frames are lists underneath the hood: this will explain a lot +of the esoteric behaviour encountered in basic operations.

+

Vector recycling and function stacks are probably best explained with +diagrams on a whiteboard.

+

Be sure to actually go through examples of an R help page: help files +can be intimidating at first, but knowing how to read them is +tremendously useful.

+

Be sure to show the CRAN task views, look at one of the topics.

+

There’s a lot of content: move quickly through the earlier lessons. +Their extensiveness is mostly for purposes of learning by osmosis: so +that their memory will trigger later when they encounter a problem or +some esoteric behaviour.

+

Key lessons to take time on:

+
    +
  • Data subsetting - conceptually difficult for novices
  • +
  • Functions - learners especially struggle with this
  • +
  • Data structures - worth being thorough, but you can go through it +quickly.
  • +
+

Don’t worry about being correct or knowing the material +back-to-front. Use mistakes as teaching moments: the most vital skill +you can impart is how to debug and recover from unexpected errors.

+
+
+
+
+ + +
+ + +
+ + + + + diff --git a/instructor/01-rstudio-intro.html b/instructor/01-rstudio-intro.html new file mode 100644 index 000000000..b643c6f02 --- /dev/null +++ b/instructor/01-rstudio-intro.html @@ -0,0 +1,1470 @@ + +R for Reproducible Scientific Analysis: Introduction to R and RStudio +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Introduction to R and RStudio

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 55 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How to find your way around RStudio?
  • +
  • How to interact with R?
  • +
  • How to manage your environment?
  • +
  • How to install packages?
  • +
+
+
+
+
+
+

Objectives

+
  • Describe the purpose and use of each pane in the RStudio IDE
  • +
  • Locate buttons and options in the RStudio IDE
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Manage a workspace in an interactive R session
  • +
  • Use mathematical and comparison operators
  • +
  • Call functions
  • +
  • Manage packages
  • +
+
+
+
+
+

Motivation +

+

Science is a multi-step process: once you’ve designed an experiment +and collected data, the real fun begins! This lesson will teach you how +to start this process using R and RStudio. We will begin with raw data, +perform exploratory analyses, and learn how to plot results graphically. +This example starts with a dataset from gapminder.org containing population +information for many countries through time. Can you read the data into +R? Can you plot the population for Senegal? Can you calculate the +average income for countries on the continent of Asia? By the end of +these lessons you will be able to do things like plot the populations +for all of these countries in under a minute!

+

Before Starting The Workshop +

+

Please ensure you have the latest version of R and RStudio installed +on your machine. This is important, as some packages used in the +workshop may not install correctly (or at all) if R is not up to +date.

+

Introduction to RStudio +

+

Welcome to the R portion of the Software Carpentry workshop.

+

Throughout this lesson, we’re going to teach you some of the +fundamentals of the R language as well as some best practices for +organizing code for scientific projects that will make your life +easier.

+

We’ll be using RStudio: a free, open-source R Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms +(including on servers) and provides many advantages such as integration +with version control and project management.

+

Basic layout

+

When you first open RStudio, you will be greeted by three panels:

+
  • The interactive R console/Terminal (entire left)
  • +
  • Environment/History/Connections (tabbed in upper right)
  • +
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • +
RStudio layout

Once you open files, such as R scripts, an editor panel will also +open in the top left.

+
RStudio layout with .R file open
+
+ +
+
+

R scripts +

+
+

Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have .R at the end of their +names to let you know what they are.

+
+
+
+

Workflow within RStudio +

+

There are two main ways one can work within RStudio:

+
  1. Test and play within the interactive R console then copy code into a +.R file to run later.
  2. +
  • This works well when doing small tests and initially starting +off.
  • +
  • It quickly becomes laborious
  • +
  1. Start writing in a .R file and use RStudio’s short cut keys for the +Run command to push the current line, selected lines or modified lines +to the interactive R console.
  2. +
  • This is a great way to start; all your code is saved for later
  • +
  • You will be able to run the file you create from within RStudio or +using R’s source() function.
  • +
+
+ +
+
+

Tip: Running segments of your code +

+
+

RStudio offers you great flexibility in running code from within the +editor window. There are buttons, menu choices, and keyboard shortcuts. +To run the current line, you can

+
  1. click on the Run button above the editor panel, or
  2. +
  3. select “Run Lines” from the “Code” menu, or
  4. +
  5. hit Ctrl+Return in Windows or Linux or ++Return on OS X. (This shortcut can also be seen +by hovering the mouse over the button). To run a block of code, select +it and then Run. If you have modified a line of code within +a block of code you have just run, there is no need to reselect the +section and Run, you can use the next button along, +Re-run the previous region. This will run the previous code +block including the modifications you have made.
  6. +
+
+
+

Introduction to R +

+

Much of your time in R will be spent in the R interactive console. +This is where you will run all of your code, and can be a useful +environment to try out ideas before adding them to an R script file. +This console in RStudio is the same as the one you would get if you +typed in R in your command-line environment.

+

The first thing you will see in the R interactive session is a bunch +of information, followed by a “>” and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a “Read, evaluate, print +loop”: you type in commands, R tries to execute them, and then returns a +result.

+

Using R as a calculator +

+

The simplest thing you could do with R is to do arithmetic:

+
+

R +

+
+1 + 100
+
+
+

OUTPUT +

+
[1] 101
+
+

And R will print out the answer, with a preceding “[1]”. [1] is the +index of the first element of the line being printed in the console. For +more information on indexing vectors, see Episode +6: Subsetting Data.

+

If you type in an incomplete command, R will wait for you to complete +it. If you are familiar with Unix Shell’s bash, you may recognize +this
+behavior from bash.

+
+

R +

+
> 1 +
+
+
+

OUTPUT +

+
+
+
+

Any time you hit return and the R session shows a “+” instead of a +“>”, it means it’s waiting for you to complete the command. If you +want to cancel a command you can hit Esc and RStudio will +give you back the “>” prompt.

+
+
+ +
+
+

Tip: Canceling commands +

+
+

If you’re using R from the command line instead of from within +RStudio, you need to use Ctrl+C instead of +Esc to cancel the command. This applies to Mac users as +well!

+

Canceling a command isn’t only useful for killing incomplete +commands: you can also use it to tell R to stop running code (for +example if it’s taking much longer than you expect), or to get rid of +the code you’re currently writing.

+
+
+
+

When using R as a calculator, the order of operations is the same as +you would have learned back in school.

+

From highest to lowest precedence:

+
  • Parentheses: (, ) +
  • +
  • Exponents: ^ or ** +
  • +
  • Multiply: * +
  • +
  • Divide: / +
  • +
  • Add: + +
  • +
  • Subtract: - +
  • +
+

R +

+
+3 + 5 * 2
+
+
+

OUTPUT +

+
[1] 13
+
+

Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend.

+
+

R +

+
+(3 + 5) * 2
+
+
+

OUTPUT +

+
[1] 16
+
+

This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code.

+
+

R +

+
+(3 + (5 * (2 ^ 2))) # hard to read
+3 + 5 * 2 ^ 2       # clear, if you remember the rules
+3 + 5 * (2 ^ 2)     # if you forget some rules, this might help
+
+

The text after each line of code is called a “comment”. Anything that +follows after the hash (or octothorpe) symbol # is ignored +by R when it executes code.

+

Really small or large numbers get a scientific notation:

+
+

R +

+
+2/10000
+
+
+

OUTPUT +

+
[1] 2e-04
+
+

Which is shorthand for “multiplied by 10^XX”. So +2e-4 is shorthand for 2 * 10^(-4).

+

You can write numbers in scientific notation too:

+
+

R +

+
+5e3  # Note the lack of minus here
+
+
+

OUTPUT +

+
[1] 5000
+
+

Mathematical functions +

+

R has many built in mathematical functions. To call a function, we +can type its name, followed by open and closing parentheses. Functions +take arguments as inputs, anything we type inside the parentheses of a +function is considered an argument. Depending on the function, the +number of arguments can vary from none to multiple. For example:

+
+

R +

+
+getwd() #returns an absolute filepath
+
+

doesn’t require an argument, whereas for the next set of mathematical +functions we will need to supply the function a value in order to +compute the result.

+
+

R +

+
+sin(1)  # trigonometry functions
+
+
+

OUTPUT +

+
[1] 0.841471
+
+
+

R +

+
+log(1)  # natural logarithm
+
+
+

OUTPUT +

+
[1] 0
+
+
+

R +

+
+log10(10) # base-10 logarithm
+
+
+

OUTPUT +

+
[1] 1
+
+
+

R +

+
+exp(0.5) # e^(1/2)
+
+
+

OUTPUT +

+
[1] 1.648721
+
+

Don’t worry about trying to remember every function in R. You can +look them up on Google, or if you can remember the start of the +function’s name, use the tab completion in RStudio.

+

This is one advantage that RStudio has over R on its own, it has +auto-completion abilities that allow you to more easily look up +functions, their arguments, and the values that they take.

+

Typing a ? before the name of a command will open the +help page for that command. When using RStudio, this will open the +‘Help’ pane; if using R in the terminal, the help page will open in your +browser. The help page will include a detailed description of the +command and how it works. Scrolling to the bottom of the help page will +usually show a collection of code examples which illustrate command +usage. We’ll go through an example later.

+

Comparing things +

+

We can also do comparisons in R:

+
+

R +

+
+1 == 1  # equality (note two equals signs, read as "is equal to")
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 != 2  # inequality (read as "is not equal to")
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 < 2  # less than
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 <= 1  # less than or equal to
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 > 0  # greater than
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 >= -9 # greater than or equal to
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+
+ +
+
+

Tip: Comparing Numbers +

+
+

A word of warning about comparing numbers: you should never use +== to compare two numbers unless they are integers (a data +type which can specifically represent only whole numbers).

+

Computers may only represent decimal numbers with a certain degree of +precision, so two numbers which look the same when printed out by R, may +actually have different underlying representations and therefore be +different by a small margin of error (called Machine numeric +tolerance).

+

Instead you should use the all.equal function.

+

Further reading: http://floating-point-gui.de/

+
+
+
+

Variables and assignment +

+

We can store values in variables using the assignment operator +<-, like this:

+
+

R +

+
+x <- 1/40
+
+

Notice that assignment does not print a value. Instead, we stored it +for later in something called a variable. +x now contains the value +0.025:

+
+

R +

+
+x
+
+
+

OUTPUT +

+
[1] 0.025
+
+

More precisely, the stored value is a decimal approximation +of this fraction called a floating point +number.

+

Look for the Environment tab in the top right panel of +RStudio, and you will see that x and its value have +appeared. Our variable x can be used in place of a number +in any calculation that expects a number:

+
+

R +

+
+log(x)
+
+
+

OUTPUT +

+
[1] -3.688879
+
+

Notice also that variables can be reassigned:

+
+

R +

+
+x <- 100
+
+

x used to contain the value 0.025 and now it has the +value 100.

+

Assignment values can contain the variable being assigned to:

+
+

R +

+
+x <- x + 1 #notice how RStudio updates its description of x on the top right tab
+y <- x * 2
+
+

The right hand side of the assignment can be any valid R expression. +The right hand side is fully evaluated before the assignment +occurs.

+

Variable names can contain letters, numbers, underscores and periods +but no spaces. They must start with a letter or a period followed by a +letter (they cannot start with a number nor an underscore). Variables +beginning with a period are hidden variables. Different people use +different conventions for long variable names, these include

+
  • periods.between.words
  • +
  • underscores_between_words
  • +
  • camelCaseToSeparateWords
  • +

What you use is up to you, but be consistent.

+

It is also possible to use the = operator for +assignment:

+
+

R +

+
+x = 1/40
+
+

But this is much less common among R users. The most important thing +is to be consistent with the operator you use. There +are occasionally places where it is less confusing to use +<- than =, and it is the most common symbol +used in the community. So the recommendation is to use +<-.

+
+
+ +
+
+

Challenge 1 +

+
+

Which of the following are valid R variable names?

+
+

R +

+
min_height
+max.height
+_age
+.mass
+MaxLength
+min-length
+2widths
+celsius2kelvin
+
+
+
+
+
+
+ +
+
+

The following can be used as R variables:

+
+

R +

+
+min_height
+max.height
+MaxLength
+celsius2kelvin
+
+

The following creates a hidden variable:

+
+

R +

+
+.mass
+
+

The following will not be able to be used to create a variable

+
+

R +

+
_age
+min-length
+2widths
+
+
+
+
+
+

Vectorization +

+

One final thing to be aware of is that R is vectorized, +meaning that variables and functions can have vectors as values. In +contrast to physics and mathematics, a vector in R describes a set of +values in a certain order of the same data type. For example

+
+

R +

+
+1:5
+
+
+

OUTPUT +

+
[1] 1 2 3 4 5
+
+
+

R +

+
+2^(1:5)
+
+
+

OUTPUT +

+
[1]  2  4  8 16 32
+
+
+

R +

+
+x <- 1:5
+2^x
+
+
+

OUTPUT +

+
[1]  2  4  8 16 32
+
+

This is incredibly powerful; we will discuss this further in an +upcoming lesson.

+

Managing your environment +

+

There are a few useful commands you can use to interact with the R +session.

+

ls will list all of the variables and functions stored +in the global environment (your working R session):

+
+

R +

+
+ls()
+
+
+

OUTPUT +

+
[1] "x" "y"
+
+
+
+ +
+
+

Tip: hidden objects +

+
+

Like in the shell, ls will hide any variables or +functions starting with a “.” by default. To list all objects, type +ls(all.names=TRUE) instead

+
+
+
+

Note here that we didn’t give any arguments to ls, but +we still needed to give the parentheses to tell R to call the +function.

+

If we type ls by itself, R prints a bunch of code +instead of a listing of objects.

+
+

R +

+
+ls
+
+
+

OUTPUT +

+
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
+    pattern, sorted = TRUE) 
+{
+    if (!missing(name)) {
+        pos <- tryCatch(name, error = function(e) e)
+        if (inherits(pos, "error")) {
+            name <- substitute(name)
+            if (!is.character(name)) 
+                name <- deparse(name)
+            warning(gettextf("%s converted to character string", 
+                sQuote(name)), domain = NA)
+            pos <- name
+        }
+    }
+    all.names <- .Internal(ls(envir, all.names, sorted))
+    if (!missing(pattern)) {
+        if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
+            ll != length(grep("]", pattern, fixed = TRUE))) {
+            if (pattern == "[") {
+                pattern <- "\\["
+                warning("replaced regular expression pattern '[' by  '\\\\['")
+            }
+            else if (length(grep("[^\\\\]\\[<-", pattern))) {
+                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+            }
+        }
+        grep(pattern, all.names, value = TRUE)
+    }
+    else all.names
+}
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>
+
+

What’s going on here?

+

Like everything in R, ls is the name of an object, and +entering the name of an object by itself prints the contents of the +object. The object x that we created earlier contains 1, 2, +3, 4, 5:

+
+

R +

+
+x
+
+
+

OUTPUT +

+
[1] 1 2 3 4 5
+
+

The object ls contains the R code that makes the +ls function work! We’ll talk more about how functions work +and start writing our own later.

+

You can use rm to delete objects you no longer need:

+
+

R +

+
+rm(x)
+
+

If you have lots of things in your environment and want to delete all +of them, you can pass the results of ls to the +rm function:

+
+

R +

+
+rm(list = ls())
+
+

In this case we’ve combined the two. Like the order of operations, +anything inside the innermost parentheses is evaluated first, and so +on.

+

In this case we’ve specified that the results of ls +should be used for the list argument in rm. +When assigning values to arguments by name, you must use the += operator!!

+

If instead we use <-, there will be unintended side +effects, or you may get an error message:

+
+

R +

+
+rm(list <- ls())
+
+
+

ERROR +

+
Error in rm(list <- ls()): ... must contain names or character strings
+
+
+
+ +
+
+

Tip: Warnings vs. Errors +

+
+

Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn’t worked as expected.

+

In both cases, the message that R prints out usually give you clues +how to fix a problem.

+
+
+
+

R Packages +

+

It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages:

+
  • You can see what packages are installed by typing +installed.packages() +
  • +
  • You can install packages by typing +install.packages("packagename"), where +packagename is the package name, in quotes.
  • +
  • You can update installed packages by typing +update.packages() +
  • +
  • You can remove a package with +remove.packages("packagename") +
  • +
  • You can make a package available for use with +library(packagename) +
  • +

Packages can also be viewed, loaded, and detached in the Packages tab +of the lower right panel in RStudio. Clicking on this tab will display +all of the installed packages with a checkbox next to them. If the box +next to a package name is checked, the package is loaded and if it is +empty, the package is not loaded. Click an empty box to load that +package and click a checked box to detach that package.

+

Packages can be installed and updated from the Package tab with the +Install and Update buttons at the top of the tab.

+
+
+ +
+
+

Challenge 2 +

+
+

What will be the value of each variable after each statement in the +following program?

+
+

R +

+
+mass <- 47.5
+age <- 122
+mass <- mass * 2.3
+age <- age - 20
+
+
+
+
+
+
+ +
+
+
+

R +

+
+mass <- 47.5
+
+

This will give a value of 47.5 for the variable mass

+
+

R +

+
+age <- 122
+
+

This will give a value of 122 for the variable age

+
+

R +

+
+mass <- mass * 2.3
+
+

This will multiply the existing value of 47.5 by 2.3 to give a new +value of 109.25 to the variable mass.

+
+

R +

+
+age <- age - 20
+
+

This will subtract 20 from the existing value of 122 to give a new +value of 102 to the variable age.

+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age?

+
+
+
+
+
+ +
+
+

One way of answering this question in R is to use the +> to set up the following:

+
+

R +

+
+mass > age
+
+
+

OUTPUT +

+
[1] TRUE
+
+

This should yield a boolean value of TRUE since 109.25 is greater +than 102.

+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Clean up your working environment by deleting the mass and age +variables.

+
+
+
+
+
+ +
+
+

We can use the rm command to accomplish this task

+
+

R +

+
+rm(age, mass)
+
+
+
+
+
+
+
+ +
+
+

Challenge 5 +

+
+

Install the following packages: ggplot2, +plyr, gapminder

+
+
+
+
+
+ +
+
+

We can use the install.packages() command to install the +required packages.

+
+

R +

+
+install.packages("ggplot2")
+install.packages("plyr")
+install.packages("gapminder")
+
+

An alternate solution, to install multiple packages with a single +install.packages() command is:

+
+

R +

+
+install.packages(c("ggplot2", "plyr", "gapminder"))
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/02-project-intro.html b/instructor/02-project-intro.html new file mode 100644 index 000000000..7f0b12b7a --- /dev/null +++ b/instructor/02-project-intro.html @@ -0,0 +1,822 @@ + +R for Reproducible Scientific Analysis: Project Management With RStudio +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Project Management With RStudio

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 30 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I manage my projects in R?
  • +
+
+
+
+
+
+

Objectives

+
  • Create self-contained projects in RStudio
  • +
+
+
+
+
+

Introduction +

+

The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and eventually +everything is a bit mixed together.

+ +

Most people tend to organize their projects like this:

+
Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

+
  1. It is really hard to tell which version of your data is the original +and which is the modified;
  2. +
  3. It gets really messy because it mixes files with various extensions +together;
  4. +
  5. It probably takes you a lot of time to actually find things, and +relate the correct figures to the exact code that has been used to +generate it;
  6. +

A good project layout will ultimately make your life easier:

+
  • It will help ensure the integrity of your data;
  • +
  • It makes it simpler to share your code with someone else (a +lab-mate, collaborator, or supervisor);
  • +
  • It allows you to easily upload your code with your manuscript +submission;
  • +
  • It makes it easier to pick the project back up after a break.
  • +

A possible solution +

+

Fortunately, there are tools and packages which can help you manage +your work effectively.

+

One of the most powerful and useful aspects of RStudio is its project +management functionality. We’ll be using this today to create a +self-contained, reproducible project.

+
+
+ +
+
+

Challenge 1: Creating a self-contained +project +

+
+

We’re going to create a new project in RStudio:

+
  1. Click the “File” menu button, then “New Project”.
  2. +
  3. Click “New Directory”.
  4. +
  5. Click “New Project”.
  6. +
  7. Type in the name of the directory to store your project, +e.g. “my_project”.
  8. +
  9. If available, select the checkbox for “Create a git +repository.”
  10. +
  11. Click the “Create Project” button.
  12. +
+
+
+

The simplest way to open an RStudio project once it has been created +is to click through your file system to get to the directory where it +was saved and double click on the .Rproj file. This will +open RStudio and start your R session in the same directory as the +.Rproj file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added +benefit of allowing you to open multiple projects at the same time each +open to its own project directory. This allows you to keep multiple +projects open without them interfering with each other.

+
+
+ +
+
+

Challenge 2: Opening an RStudio project +through the file system +

+
+
  1. Exit RStudio.
  2. +
  3. Navigate to the directory where you created a project in Challenge +1.
  4. +
  5. Double click on the .Rproj file in that directory.
  6. +
+
+
+

Best practices for project organization +

+

Although there is no “best” way to lay out a project, there are some +general principles to adhere to that will make project management +easier:

+
+

Treat data as read only

+

This is probably the most important goal of setting up a project. +Data is typically time consuming and/or expensive to collect. Working +with them interactively (e.g., in Excel) where they can be modified +means you are never sure of where the data came from, or how it has been +modified since collection. It is therefore a good idea to treat your +data as “read-only”.

+
+
+

Data Cleaning

+

In many cases your data will be “dirty”: it will need significant +preprocessing to get into a format R (or any other programming language) +will find useful. This task is sometimes called “data munging”. Storing +these scripts in a separate folder, and creating a second “read-only” +data folder to hold the “cleaned” data sets can prevent confusion +between the two sets.

+
+
+

Treat generated output as disposable

+

Anything generated by your scripts should be treated as disposable: +it should all be able to be regenerated from your scripts.

+

There are lots of different ways to manage this output. Having an +output folder with different sub-directories for each separate analysis +makes it easier later. Since many analyses are exploratory and don’t end +up being used in the final project, and some of the analyses get shared +between projects.

+
+
+ +
+
+

Tip: Good Enough Practices for Scientific +Computing +

+
+

Good +Enough Practices for Scientific Computing gives the following +recommendations for project organization:

+
  1. Put each project in its own directory, which is named after the +project.
  2. +
  3. Put text documents associated with the project in the +doc directory.
  4. +
  5. Put raw data and metadata in the data directory, and +files generated during cleanup and analysis in a results +directory.
  6. +
  7. Put source for the project’s scripts and programs in the +src directory, and programs brought in from elsewhere or +compiled locally in the bin directory.
  8. +
  9. Name all files to reflect their content or function.
  10. +
+
+
+
+
+

Separate function definition and application

+

One of the more effective ways to work with R is to start by writing +the code you want to run directly in a .R script, and then running the +selected lines (either using the keyboard shortcuts in RStudio or +clicking the “Run” button) in the interactive R console.

+

When your project is in its early stages, the initial .R script file +usually contains many lines of directly executed code. As it matures, +reusable chunks get pulled into their own functions. It’s a good idea to +separate these functions into two separate folders; one to store useful +functions that you’ll reuse across analyses and projects, and one to +store the analysis scripts.

+
+
+

Save the data in the data directory

+

Now we have a good directory structure we will now place/save the +data file in the data/ directory.

+
+
+ +
+
+

Challenge 3 +

+
+

Download the gapminder data from here.

+
  1. Download the file (right mouse click on the link above -> “Save +link as” / “Save file as”, or click on the link and after the page +loads, press Ctrl+S or choose File -> “Save +page as”)
  2. +
  3. Make sure it’s saved under the name +gapminder_data.csv +
  4. +
  5. Save the file in the data/ folder within your +project.
  6. +

We will load and inspect these data later.

+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

It is useful to get some general idea about the dataset, directly +from the command line, before loading it into R. Understanding the +dataset better will come in handy when making decisions on how to load +it in R. Use the command-line shell to answer the following +questions:

+
  1. What is the size of the file?
  2. +
  3. How many rows of data does it contain?
  4. +
  5. What kinds of values are stored in this file?
  6. +
+
+
+
+
+ +
+
+

By running these commands in the shell:

+
+

SH +

+
ls -lh data/gapminder_data.csv
+
+
+

OUTPUT +

+
-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv
+
+

The file size is 80K.

+
+

SH +

+
wc -l data/gapminder_data.csv
+
+
+

OUTPUT +

+
1705 data/gapminder_data.csv
+
+

There are 1705 lines. The data looks like:

+
+

SH +

+
head data/gapminder_data.csv
+
+
+

OUTPUT +

+
country,year,pop,continent,lifeExp,gdpPercap
+Afghanistan,1952,8425333,Asia,28.801,779.4453145
+Afghanistan,1957,9240934,Asia,30.332,820.8530296
+Afghanistan,1962,10267083,Asia,31.997,853.10071
+Afghanistan,1967,11537966,Asia,34.02,836.1971382
+Afghanistan,1972,13079460,Asia,36.088,739.9811058
+Afghanistan,1977,14880372,Asia,38.438,786.11336
+Afghanistan,1982,12881816,Asia,39.854,978.0114388
+Afghanistan,1987,13867957,Asia,40.822,852.3959448
+Afghanistan,1992,16317921,Asia,41.674,649.3413952
+
+
+
+
+
+
+
+ +
+
+

Tip: command line in RStudio +

+
+

The Terminal tab in the console pane provides a convenient place +directly within RStudio to interact directly with the command line.

+
+
+
+
+
+

Working directory

+

Knowing R’s current working directory is important because when you +need to access other files (for example, to import a data file), R will +look for them relative to the current working directory.

+

Each time you create a new RStudio Project, it will create a new +directory for that project. When you open an existing +.Rproj file, it will open that project and set R’s working +directory to the folder that file is in.

+
+
+ +
+
+

Challenge 5 +

+
+

You can check the current working directory with the +getwd() command, or by using the menus in RStudio.

+
  1. In the console, type getwd() (“wd” is short for +“working directory”) and hit Enter.
  2. +
  3. In the Files pane, double click on the data folder to +open it (or navigate to any other folder you wish). To get the Files +pane back to the current working directory, click “More” and then select +“Go To Working Directory”.
  4. +

You can change the working directory with setwd(), or by +using RStudio menus.

+
  1. In the console, type setwd("data") and hit Enter. Type +getwd() and hit Enter to see the new working +directory.
  2. +
  3. In the menus at the top of the RStudio window, click the “Session” +menu button, and then select “Set Working Directory” and then “Choose +Directory”. Next, in the windows navigator that opens, navigate back to +the project directory, and click “Open”. Note that a setwd +command will automatically appear in the console.
  4. +
+
+
+
+
+ +
+
+

Tip: File does not exist errors +

+
+

When you’re attempting to reference a file in your R code and you’re +getting errors saying the file doesn’t exist, it’s a good idea to check +your working directory. You need to either provide an absolute path to +the file, or you need to make sure the file is saved in the working +directory (or a subfolder of the working directory) and provide a +relative path.

+
+
+
+
+
+

Version Control

+

It is important to use version control with projects. Go here +for a good lesson which describes using Git with RStudio.

+
+
+ +
+
+

Keypoints +

+
+
  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +
+
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/03-seeking-help.html b/instructor/03-seeking-help.html new file mode 100644 index 000000000..1e66c24ff --- /dev/null +++ b/instructor/03-seeking-help.html @@ -0,0 +1,861 @@ + +R for Reproducible Scientific Analysis: Seeking Help +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Seeking Help

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 20 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I get help in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to read R help files for functions and special +operators.
  • +
  • To be able to use CRAN task views to identify packages to solve a +problem.
  • +
  • To be able to seek help from your peers.
  • +
+
+
+
+
+

Reading Help Files +

+

R, and every package, provide help files for functions. The general +syntax to search for help on any function, “function_name”, from a +specific function that is in a package loaded into your namespace (your +interactive R session) is:

+
+

R +

+
+?function_name
+help(function_name)
+
+

For example take a look at the help file for +write.table(), we will be using a similar function in an +upcoming episode.

+
+

R +

+
+?write.table()
+
+

This will load up a help page in RStudio (or as plain text in R +itself).

+

Each help page is broken down into sections:

+
  • Description: An extended description of what the function does.
  • +
  • Usage: The arguments of the function and their default values (which +can be changed).
  • +
  • Arguments: An explanation of the data each argument is +expecting.
  • +
  • Details: Any important details to be aware of.
  • +
  • Value: The data the function returns.
  • +
  • See Also: Any related functions you might find useful.
  • +
  • Examples: Some examples for how to use the function.
  • +

Different functions might have different sections, but these are the +main ones you should be aware of.

+

Notice how related functions might call for the same help file:

+
+

R +

+
+?write.table()
+?write.csv()
+
+

This is because these functions have very similar applicability and +often share the same arguments as inputs to the function, so package +authors often choose to document them together in a single help +file.

+
+
+ +
+
+

Tip: Running Examples +

+
+

From within the function help page, you can highlight code in the +Examples and hit Ctrl+Return to run it in RStudio +console. This gives you a quick way to get a feel for how a function +works.

+
+
+
+
+
+ +
+
+

Tip: Reading Help Files +

+
+

One of the most daunting aspects of R is the large number of +functions available. It would be prohibitive, if not impossible to +remember the correct usage for every function you use. Luckily, using +the help files means you don’t have to remember that!

+
+
+
+

Special Operators +

+

To seek help on special operators, use quotes or backticks:

+
+

R +

+
+?"<-"
+?`<-`
+
+

Getting Help with Packages +

+

Many packages come with “vignettes”: tutorials and extended example +documentation. Without any arguments, vignette() will list +all vignettes for all installed packages; +vignette(package="package-name") will list all available +vignettes for package-name, and +vignette("vignette-name") will open the specified +vignette.

+

If a package doesn’t have any vignettes, you can usually find help by +typing help("package-name").

+

RStudio also has a set of excellent cheatsheets for +many packages.

+

When You Remember Part of the Function Name +

+

If you’re not sure what package a function is in or how it’s +specifically spelled, you can do a fuzzy search:

+
+

R +

+
+??function_name
+
+

A fuzzy search is when you search for an approximate string match. +For example, you may remember that the function to set your working +directory includes “set” in its name. You can do a fuzzy search to help +you identify the function:

+
+

R +

+
+??set
+
+

When You Have No Idea Where to Begin +

+

If you don’t know what function or package you need to use CRAN Task Views is a +specially maintained list of packages grouped into fields. This can be a +good starting point.

+

When Your Code Doesn’t Work: Seeking Help from Your Peers +

+

If you’re having trouble using a function, 9 times out of 10, the +answers you seek have already been answered on Stack Overflow. You can search +using the [r] tag. Please make sure to see their page on how to ask a good +question.

+

If you can’t find the answer, there are a few useful functions to +help you ask your peers:

+
+

R +

+
+?dput
+
+

Will dump the data you’re working with into a format that can be +copied and pasted by others into their own R session.

+
+

R +

+
+sessionInfo()
+
+
+

OUTPUT +

+
R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
+ [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
+ [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
+[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
+
+time zone: UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats     graphics  grDevices utils     datasets  methods   base     
+
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0 yaml_2.3.7       
+[5] knitr_1.43        xfun_0.40         renv_1.0.3        evaluate_0.21    
+
+

Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue.

+
+
+ +
+
+

Challenge 1 +

+
+

Look at the help page for the c function. What kind of +vector do you expect will be created if you evaluate the following:

+
+

R +

+
+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+
+
+
+
+
+
+ +
+
+

The c() function creates a vector, in which all elements +are of the same type. In the first case, the elements are numeric, in +the second, they are characters, and in the third they are also +characters: the numeric values are “coerced” to be characters.

+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Look at the help for the paste function. You will need +to use it later. What’s the difference between the sep and +collapse arguments?

+
+
+
+
+
+ +
+
+

To look at the help for the paste() function, use:

+
+

R +

+
+help("paste")
+?paste
+
+

The difference between sep and collapse is +a little tricky. The paste function accepts any number of +arguments, each of which can be a vector of any length. The +sep argument specifies the string used between concatenated +terms — by default, a space. The result is a vector as long as the +longest argument supplied to paste. In contrast, +collapse specifies that after concatenation the elements +are collapsed together using the given separator, the result +being a single string.

+

It is important to call the arguments explicitly by typing out the +argument name e.g sep = "," so the function understands to +use the “,” as a separator and not a term to concatenate. e.g.

+
+

R +

+
+paste(c("a","b"), "c")
+
+
+

OUTPUT +

+
[1] "a c" "b c"
+
+
+

R +

+
+paste(c("a","b"), "c", ",")
+
+
+

OUTPUT +

+
[1] "a c ," "b c ,"
+
+
+

R +

+
+paste(c("a","b"), "c", sep = ",")
+
+
+

OUTPUT +

+
[1] "a,c" "b,c"
+
+
+

R +

+
+paste(c("a","b"), "c", collapse = "|")
+
+
+

OUTPUT +

+
[1] "a c|b c"
+
+
+

R +

+
+paste(c("a","b"), "c", sep = ",", collapse = "|")
+
+
+

OUTPUT +

+
[1] "a,c|b,c"
+
+

(For more information, scroll to the bottom of the +?paste help page and look at the examples, or try +example('paste').)

+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Use help to find a function (and its associated parameters) that you +could use to load data from a tabular file in which columns are +delimited with “\t” (tab) and the decimal point is a “.” (period). This +check for decimal separator is important, especially if you are working +with international colleagues, because different countries have +different conventions for the decimal point (i.e. comma vs period). +Hint: use ??"read table" to look up functions related to +reading in tabular data.

+
+
+
+
+
+ +
+
+

The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +read.table(file, sep="\t") (the period is the +default decimal separator for read.table()), +although you may have to change the comment.char argument +as well if your data file contains hash (#) characters.

+
+
+
+
+

Other Resources +

+
+
+ +
+
+

Keypoints +

+
+
  • Use help() to get online help in R.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/04-data-structures-part1.html b/instructor/04-data-structures-part1.html new file mode 100644 index 000000000..6c12f00c1 --- /dev/null +++ b/instructor/04-data-structures-part1.html @@ -0,0 +1,2397 @@ + +R for Reproducible Scientific Analysis: Data Structures +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Data Structures

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 55 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I read data in R?
  • +
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to identify the 5 main data types.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
  • To understand the information of the attributes “names”, “class”, +and “dim”.
  • +
+
+
+
+
+

One of R’s most powerful features is its ability to deal with tabular +data - such as you may already have in a spreadsheet or a CSV file. +Let’s start by making a toy dataset in your data/ +directory, called feline-data.csv:

+
+

R +

+
+cats <- data.frame(coat = c("calico", "black", "tabby"),
+                    weight = c(2.1, 5.0, 3.2),
+                    likes_string = c(1, 0, 1))
+
+

We can now save cats as a CSV file. It is good practice +to call the argument names explicitly so the function knows what default +values you are changing. Here we are setting +row.names = FALSE. Recall you can use +?write.csv to pull up the help file to check out the +argument names and their default values.

+
+

R +

+
+write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)
+
+

The contents of the new file, feline-data.csv:

+
+

R +

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+
+
+
+ +
+
+

Tip: Editing Text files in R +

+
+

Alternatively, you can create data/feline-data.csv using +a text editor (Nano), or within RStudio with the File -> New +File -> Text File menu item.

+
+
+
+

We can load this into R via the following:

+
+

R +

+
+cats <- read.csv(file = "data/feline-data.csv")
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1
+
+

The read.table function is used for reading in tabular +data stored in a text file where the columns of data are separated by +punctuation characters such as CSV files (csv = comma-separated values). +Tabs and commas are the most common punctuation characters used to +separate or delimit data points in csv files. For convenience R provides +2 other versions of read.table. These are: +read.csv for files where the data are separated with commas +and read.delim for files where the data are separated with +tabs. Of these three functions read.csv is the most +commonly used. If needed it is possible to override the default +delimiting punctuation marks for both read.csv and +read.delim.

+
+
+ +
+
+

Check your data for factors +

+
+

In recent times, the default way how R handles textual data has +changed. Text data was interpreted by R automatically into a format +called “factors”. But there is an easier format that is called +“character”. We will hear about factors later, and what to use them for. +For now, remember that in most cases, they are not needed and only +complicate your life, which is why newer R versions read in text as +“character”. Check now if your version of R has automatically created +factors and convert them to “character” format:

+
  1. Check the data types of your input by typing +str(cats) +
  2. +
  3. In the output, look at the three-letter codes after the colons: If +you see only “num” and “chr”, you can continue with the lesson and skip +this box. If you find “fct”, continue to step 3.
  4. +
  5. Prevent R from automatically creating “factor” data. That can be +done by the following code: +options(stringsAsFactors = FALSE). Then, re-read the cats +table for the change to take effect.
  6. +
  7. You must set this option every time you restart R. To not forget +this, include it in your analysis script before you read in any data, +for example in one of the first lines.
  8. +
  9. For R versions greater than 4.0.0, text data is no longer converted +to factors anymore. So you can install this or a newer version to avoid +this problem. If you are working on an institute or company computer, +ask your administrator to do it.
  10. +
+
+
+

We can begin exploring our dataset right away, pulling out columns by +specifying them using the $ operator:

+
+

R +

+
+cats$weight
+
+
+

OUTPUT +

+
[1] 2.1 5.0 3.2
+
+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

We can do other operations on the columns:

+
+

R +

+
+## Say we discovered that the scale weighs two Kg light:
+cats$weight + 2
+
+
+

OUTPUT +

+
[1] 4.1 7.0 5.2
+
+
+

R +

+
+paste("My cat is", cats$coat)
+
+
+

OUTPUT +

+
[1] "My cat is calico" "My cat is black"  "My cat is tabby" 
+
+

But what about

+
+

R +

+
+cats$weight + cats$coat
+
+
+

ERROR +

+
Error in cats$weight + cats$coat: non-numeric argument to binary operator
+
+

Understanding what happened here is key to successfully analyzing +data in R.

+
+

Data Types

+

If you guessed that the last command will return an error because +2.1 plus "black" is nonsense, you’re right - +and you already have some intuition for an important concept in +programming called data types. We can ask what type of data +something is:

+
+

R +

+
+typeof(cats$weight)
+
+
+

OUTPUT +

+
[1] "double"
+
+

There are 5 main types: double, integer, +complex, logical and character. +For historic reasons, double is also called +numeric.

+
+

R +

+
+typeof(3.14)
+
+
+

OUTPUT +

+
[1] "double"
+
+
+

R +

+
+typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers
+
+
+

OUTPUT +

+
[1] "integer"
+
+
+

R +

+
+typeof(1+1i)
+
+
+

OUTPUT +

+
[1] "complex"
+
+
+

R +

+
+typeof(TRUE)
+
+
+

OUTPUT +

+
[1] "logical"
+
+
+

R +

+
+typeof('banana')
+
+
+

OUTPUT +

+
[1] "character"
+
+

No matter how complicated our analyses become, all data in R is +interpreted as one of these basic data types. This strictness has some +really important consequences.

+

A user has added details of another cat. This information is in the +file data/feline-data_v2.csv.

+
+

R +

+
+file.show("data/feline-data_v2.csv")
+
+
+

R +

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+tabby,2.3 or 2.4,1
+
+

Load the new cats data like before, and check what type of data we +find in the weight column:

+
+

R +

+
+cats <- read.csv(file="data/feline-data_v2.csv")
+typeof(cats$weight)
+
+
+

OUTPUT +

+
[1] "character"
+
+

Oh no, our weights aren’t the double type anymore! If we try to do +the same math we did on them before, we run into trouble:

+
+

R +

+
+cats$weight + 2
+
+
+

ERROR +

+
Error in cats$weight + 2: non-numeric argument to binary operator
+
+

What happened? The cats data we are working with is +something called a data frame. Data frames are one of the most +common and versatile types of data structures we will work with +in R. A given column in a data frame cannot be composed of different +data types. In this case, R does not read everything in the data frame +column weight as a double, therefore the entire +column data type changes to something that is suitable for everything in +the column.

+

When R reads a csv file, it reads it in as a data frame. +Thus, when we loaded the cats csv file, it is stored as a +data frame. We can recognize data frames by the first row that is +written by the str() function:

+
+

R +

+
+str(cats)
+
+
+

OUTPUT +

+
'data.frame':	4 obs. of  3 variables:
+ $ coat        : chr  "calico" "black" "tabby" "tabby"
+ $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
+ $ likes_string: int  1 0 1 1
+
+

Data frames are composed of rows and columns, where each +column has the same number of rows. Different columns in a data frame +can be made up of different data types (this is what makes them so +versatile), but everything in a given column needs to be the same type +(e.g., vector, factor, or list).

+

Let’s explore more about different data structures and how they +behave. For now, let’s remove that extra line from our cats data and +reload it, while we investigate this behavior further:

+

feline-data.csv:

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+

And back in RStudio:

+
+

R +

+
+cats <- read.csv(file="data/feline-data.csv")
+
+
+
+

Vectors and Type Coercion

+

To better understand this behavior, let’s meet another of the data +structures: the vector.

+
+

R +

+
+my_vector <- vector(length = 3)
+my_vector
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE
+
+

A vector in R is essentially an ordered list of things, with the +special condition that everything in the vector must be the same +basic data type. If you don’t choose the datatype, it’ll default to +logical; or, you can declare an empty vector of whatever +type you like.

+
+

R +

+
+another_vector <- vector(mode='character', length=3)
+another_vector
+
+
+

OUTPUT +

+
[1] "" "" ""
+
+

You can check if something is a vector:

+
+

R +

+
+str(another_vector)
+
+
+

OUTPUT +

+
 chr [1:3] "" "" ""
+
+

The somewhat cryptic output from this command indicates the basic +data type found in this vector - in this case chr, +character; an indication of the number of things in the vector - +actually, the indexes of the vector, in this case [1:3]; +and a few examples of what’s actually in the vector - in this case empty +character strings. If we similarly do

+
+

R +

+
+str(cats$weight)
+
+
+

OUTPUT +

+
 num [1:3] 2.1 5 3.2
+
+

we see that cats$weight is a vector, too - the +columns of data we load into R data.frames are all vectors, and +that’s the root of why R forces everything in a column to be the same +basic data type.

+
+
+ +
+
+

Discussion 1 +

+
+

Why is R so opinionated about what we put in our columns of data? How +does this help us?

+
+
+ +
+
+

By keeping everything in a column the same, we allow ourselves to +make simple assumptions about our data; if you can interpret one entry +in the column as a number, then you can interpret all of them +as numbers, so we don’t have to check every time. This consistency is +what people mean when they talk about clean data; in the long +run, strict consistency goes a long way to making our lives easier in +R.

+
+
+
+
+
+
+
+
+

Coercion by combining vectors

+

You can also make vectors with explicit contents with the combine +function:

+
+

R +

+
+combine_vector <- c(2,6,3)
+combine_vector
+
+
+

OUTPUT +

+
[1] 2 6 3
+
+

Given what we’ve learned so far, what do you think the following will +produce?

+
+

R +

+
+quiz_vector <- c(2,6,'3')
+
+

This is something called type coercion, and it is the source +of many surprises and the reason why we need to be aware of the basic +data types and how R will interpret them. When R encounters a mix of +types (here double and character) to be combined into a single vector, +it will force them all to be the same type. Consider:

+
+

R +

+
+coercion_vector <- c('a', TRUE)
+coercion_vector
+
+
+

OUTPUT +

+
[1] "a"    "TRUE"
+
+
+

R +

+
+another_coercion_vector <- c(0, TRUE)
+another_coercion_vector
+
+
+

OUTPUT +

+
[1] 0 1
+
+
+
+

The type hierarchy

+

The coercion rules go: logical -> +integer -> double (“numeric”) +-> complex -> character, where -> can +be read as are transformed into. For example, combining +logical and character transforms the result to +character:

+
+

R +

+
+c('a', TRUE)
+
+
+

OUTPUT +

+
[1] "a"    "TRUE"
+
+

A quick way to recognize character vectors is by the +quotes that enclose them when they are printed.

+

You can try to force coercion against this flow using the +as. functions:

+
+

R +

+
+character_vector_example <- c('0','2','4')
+character_vector_example
+
+
+

OUTPUT +

+
[1] "0" "2" "4"
+
+
+

R +

+
+character_coerced_to_double <- as.double(character_vector_example)
+character_coerced_to_double
+
+
+

OUTPUT +

+
[1] 0 2 4
+
+
+

R +

+
+double_coerced_to_logical <- as.logical(character_coerced_to_double)
+double_coerced_to_logical
+
+
+

OUTPUT +

+
[1] FALSE  TRUE  TRUE
+
+

As you can see, some surprising things can happen when R forces one +basic data type into another! Nitty-gritty of type coercion aside, the +point is: if your data doesn’t look like what you thought it was going +to look like, type coercion may well be to blame; make sure everything +is the same type in your vectors and your columns of data.frames, or you +will get nasty surprises!

+

But coercion can also be very useful! For example, in our +cats data likes_string is numeric, but we know +that the 1s and 0s actually represent TRUE and +FALSE (a common way of representing them). We should use +the logical datatype here, which has two states: +TRUE or FALSE, which is exactly what our data +represents. We can ‘coerce’ this column to be logical by +using the as.logical function:

+
+

R +

+
+cats$likes_string
+
+
+

OUTPUT +

+
[1] 1 0 1
+
+
+

R +

+
+cats$likes_string <- as.logical(cats$likes_string)
+cats$likes_string
+
+
+

OUTPUT +

+
[1]  TRUE FALSE  TRUE
+
+
+
+ +
+
+

Challenge 1 +

+
+

An important part of every data analysis is cleaning the input data. +If you know that the input data is all of the same format, +(e.g. numbers), your analysis is much easier! Clean the cat data set +from the chapter about type coercion.

+
+

Copy the code template

+

Create a new script in RStudio and copy and paste the following code. +Then move on to the tasks below, which help you to fill in the gaps +(______).

+
# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+
+# 1. Print the data
+_____
+
+# 2. Show an overview of the table with all data types
+_____(cats)
+
+# 3. The "weight" column has the incorrect data type __________.
+#    The correct data type is: ____________.
+
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+#    print the data again to see the effect
+cats
+
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+
+#    Calculate the mean to test yourself
+mean(cats$weight)
+
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!
+
+
+

Instructions for the tasks

+
+ +

Execute the first statement (read.csv(...)). Then print +the data to the console

+
+
+
+
+
+
+
+ +
+
+

Show the content of any variable by typing its name.

+
+

Solution to Challenge 1.1

+

Two correct solutions:

+
cats
+print(cats)
+
+
+
+
+
+
+
+ +
+
+

2. Overview of the data types +

+
+

The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of +the cats table.

+
+
+
+
+
+ +
+
+

In the chapter “Data types” we saw two functions that can show data +types. One printed just a single word, the data type name. The other +printed a short form of the data type, and the first few values. We need +the second here.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to Challenge 1.2

+
str(cats)
+
+
+

3. Which data type do we need?

+

The shown data type is not the right one for this data (weight of a +cat). Which data type do we need?

+
  • Why did the read.csv() function not choose the correct +data type?
  • +
  • Fill in the gap in the comment with the correct data type for cat +weight!
  • +
+
+
+
+
+
+ +
+
+

Scroll up to the section about the type +hierarchy to review the available data types

+
+
+
+
+
+
+ +
+
+
  • Weight is expressed on a continuous scale (real numbers). The R data +type for this is “double” (also known as “numeric”).
  • +
  • The fourth row has the value “2.3 or 2.4”. That is not a number but +two, and an english word. Therefore, the “character” data type is +chosen. The whole column is now text, because all values in the same +columns have to be the same data type.
  • +
+
+
+
+
+
+ +
+
+

4. Correct the problematic value +

+
+

The code to assign a new weight value to the problematic fourth row +is given. Think first and then execute it: What will be the data type +after assigning a number like in this example? You can check the data +type after executing to see if you were right.

+
+
+
+
+
+ +
+
+

Revisit the hierarchy of data types when two different data types are +combined.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to challenge 1.4

+

The data type of the column “weight” is “character”. The assigned +data type is “double”. Combining two data types yields the data type +that is higher in the following hierarchy:

+
logical < integer < double < complex < character
+

Therefore, the column is still of type character! We need to manually +convert it to “double”. {: .solution}

+
+
+

5. Convert the column “weight” to the correct data type

+

Cat weight are numbers. But the column does not have this data type +yet. Coerce the column to floating point numbers.

+
+
+
+
+
+
+ +
+
+

The functions to convert data types start with as.. You +can look for the function further up in the manuscript or use the +RStudio auto-complete function: Type “as.” and then press +the TAB key.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to Challenge 1.5

+

There are two functions that are synonymous for historic reasons:

+
cats$weight <- as.double(cats$weight)
+cats$weight <- as.numeric(cats$weight)
+
+
+
+
+
+
+
+

Some basic vector functions

+

The combine function, c(), will also append things to an +existing vector:

+
+

R +

+
+ab_vector <- c('a', 'b')
+ab_vector
+
+
+

OUTPUT +

+
[1] "a" "b"
+
+
+

R +

+
+combine_example <- c(ab_vector, 'SWC')
+combine_example
+
+
+

OUTPUT +

+
[1] "a"   "b"   "SWC"
+
+

You can also make series of numbers:

+
+

R +

+
+mySeries <- 1:10
+mySeries
+
+
+

OUTPUT +

+
 [1]  1  2  3  4  5  6  7  8  9 10
+
+
+

R +

+
+seq(10)
+
+
+

OUTPUT +

+
 [1]  1  2  3  4  5  6  7  8  9 10
+
+
+

R +

+
+seq(1,10, by=0.1)
+
+
+

OUTPUT +

+
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
+[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
+[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
+[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
+[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
+[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
+[91] 10.0
+
+

We can ask a few questions about vectors:

+
+

R +

+
+sequence_example <- 20:25
+head(sequence_example, n=2)
+
+
+

OUTPUT +

+
[1] 20 21
+
+
+

R +

+
+tail(sequence_example, n=4)
+
+
+

OUTPUT +

+
[1] 22 23 24 25
+
+
+

R +

+
+length(sequence_example)
+
+
+

OUTPUT +

+
[1] 6
+
+
+

R +

+
+typeof(sequence_example)
+
+
+

OUTPUT +

+
[1] "integer"
+
+

We can get individual elements of a vector by using the bracket +notation:

+
+

R +

+
+first_element <- sequence_example[1]
+first_element
+
+
+

OUTPUT +

+
[1] 20
+
+

To change a single element, use the bracket on the other side of the +arrow:

+
+

R +

+
+sequence_example[1] <- 30
+sequence_example
+
+
+

OUTPUT +

+
[1] 30 21 22 23 24 25
+
+
+
+ +
+
+

Challenge 2 +

+
+

Start by making a vector with the numbers 1 through 26. Then, +multiply the vector by 2.

+
+
+
+
+
+ +
+
+
+

R +

+
+x <- 1:26
+x <- x * 2
+
+
+
+
+
+
+
+

Lists

+

Another data structure you’ll want in your bag of tricks is the +list. A list is simpler in some ways than the other types, +because you can put anything you want in it. Remember everything in +the vector must be of the same basic data type, but a list can have +different data types:

+
+

R +

+
+list_example <- list(1, "a", TRUE, 1+4i)
+list_example
+
+
+

OUTPUT +

+
[[1]]
+[1] 1
+
+[[2]]
+[1] "a"
+
+[[3]]
+[1] TRUE
+
+[[4]]
+[1] 1+4i
+
+

When printing the object structure with str(), we see +the data types of all elements:

+
+

R +

+
+str(list_example)
+
+
+

OUTPUT +

+
List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i
+
+

What is the use of lists? They can organize data of different +types. For example, you can organize different tables that +belong together, similar to spreadsheets in Excel. But there are many +other uses, too.

+

We will see another example that will maybe surprise you in the next +chapter.

+

To retrieve one of the elements of a list, use the double +bracket:

+
+

R +

+
+list_example[[2]]
+
+
+

OUTPUT +

+
[1] "a"
+
+

The elements of lists also can have names, they can +be given by prepending them to the values, separated by an equals +sign:

+
+

R +

+
+another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
+another_list
+
+
+

OUTPUT +

+
$title
+[1] "Numbers"
+
+$numbers
+ [1]  1  2  3  4  5  6  7  8  9 10
+
+$data
+[1] TRUE
+
+

This results in a named list. Now we have a new +function of our object! We can access single elements by an additional +way!

+
+

R +

+
+another_list$title
+
+
+

OUTPUT +

+
[1] "Numbers"
+
+
+

Names +

+

With names, we can give meaning to elements. It is the first time +that we do not only have the data, but also explaining +information. It is metadata that can be stuck to the object +like a label. In R, this is called an attribute. Some +attributes enable us to do more with our object, for example, like here, +accessing an element by a self-defined name.

+
+

Accessing vectors and lists by name

+

We have already seen how to generate a named list. The way to +generate a named vector is very similar. You have seen this function +before:

+
+

R +

+
+pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )
+
+

The way to retrieve elements is different, though:

+
+

R +

+
+pizza_price["pizzasubito"]
+
+
+

OUTPUT +

+
pizzasubito 
+       5.64 
+
+

The approach used for the list does not work:

+
+

R +

+
+pizza_price$pizzafresh
+
+
+

ERROR +

+
Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors
+
+

It will pay off if you remember this error message, you will meet it +in your own analyses. It means that you have just tried accessing an +element like it was in a list, but it is actually in a vector.

+
+
+

Accessing and changing names

+

If you are only interested in the names, use the names() +function:

+
+

R +

+
+names(pizza_price)
+
+
+

OUTPUT +

+
[1] "pizzasubito" "pizzafresh"  "callapizza" 
+
+

We have seen how to access and change single elements of a vector. +The same is possible for names:

+
+

R +

+
+names(pizza_price)[3]
+
+
+

OUTPUT +

+
[1] "callapizza"
+
+
+

R +

+
+names(pizza_price)[3] <- "call-a-pizza"
+pizza_price
+
+
+

OUTPUT +

+
 pizzasubito   pizzafresh call-a-pizza 
+        5.64         6.60         4.50 
+
+
+
+ +
+
+

Challenge 3 +

+
+
  • What is the data type of the names of pizza_price? You +can find out using the str() or typeof() +functions.
  • +
+
+
+
+
+ +
+
+

You get the names of an object by wrapping the object name inside +names(...). Similarly, you get the data type of the names +by again wrapping the whole code in typeof(...):

+
typeof(names(pizza))
+

alternatively, use a new variable if this is easier for you to +read:

+
n <- names(pizza)
+typeof(n)
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Instead of just changing some of the names a vector/list already has, +you can also set all names of an object by writing code like (replace +ALL CAPS text):

+
names( OBJECT ) <-  CHARACTER_VECTOR
+

Create a vector that gives the number for each letter in the +alphabet!

+
  1. Generate a vector called letter_no with the sequence of +numbers from 1 to 26!
  2. +
  3. R has a built-in object called LETTERS. It is a +26-character vector, from A to Z. Set the names of the number sequence +to this 26 letters
  4. +
  5. Test yourself by calling letter_no["B"], which should +give you the number 2!
  6. +
+
+
+
+
+ +
+
+
letter_no <- 1:26   # or seq(1,26)
+names(letter_no) <- LETTERS
+letter_no["B"]
+
+
+
+
+
+

Data frames +

+

We have data frames at the very beginning of this lesson, they +represent a table of data. We didn’t go much further into detail with +our example cat data frame:

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+2  black    5.0        FALSE
+3  tabby    3.2         TRUE
+
+

We can now understand something a bit surprising in our data.frame; +what happens if we run:

+
+

R +

+
+typeof(cats)
+
+
+

OUTPUT +

+
[1] "list"
+
+

We see that data.frames look like lists ‘under the hood’. Think again +what we heard about what lists can be used for:

+
+

Lists organize data of different types

+
+

Columns of a data frame are vectors of different types, that are +organized by belonging to the same table.

+

A data.frame is really a list of vectors. It is a special list in +which all the vectors must have the same length.

+

How is this “special”-ness written into the object, so that R does +not treat it like any other list, but as a table?

+
+

R +

+
+class(cats)
+
+
+

OUTPUT +

+
[1] "data.frame"
+
+

A class, just like names, is an attribute attached +to the object. It tells us what this object means for humans.

+

You might wonder: Why do we need another +what-type-of-object-is-this-function? We already have +typeof()? That function tells us how the object is +constructed in the computer. The class is +the meaning of the object for humans. Consequently, +what typeof() returns is fixed in R (mainly the +five data types), whereas the output of class() is +diverse and extendable by R packages.

+

In our cats example, we have an integer, a double and a +logical variable. As we have seen already, each column of data.frame is +a vector.

+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+
+

R +

+
+cats[,1]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+
+

R +

+
+typeof(cats[,1])
+
+
+

OUTPUT +

+
[1] "character"
+
+
+

R +

+
+str(cats[,1])
+
+
+

OUTPUT +

+
 chr [1:3] "calico" "black" "tabby"
+
+

Each row is an observation of different variables, itself a +data.frame, and thus can be composed of elements of different types.

+
+

R +

+
+cats[1,]
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+
+
+

R +

+
+typeof(cats[1,])
+
+
+

OUTPUT +

+
[1] "list"
+
+
+

R +

+
+str(cats[1,])
+
+
+

OUTPUT +

+
'data.frame':	1 obs. of  3 variables:
+ $ coat        : chr "calico"
+ $ weight      : num 2.1
+ $ likes_string: logi TRUE
+
+
+
+ +
+
+

Challenge 5 +

+
+

There are several subtly different ways to call variables, +observations and elements from data.frames:

+
  • cats[1]
  • +
  • cats[[1]]
  • +
  • cats$coat
  • +
  • cats["coat"]
  • +
  • cats[1, 1]
  • +
  • cats[, 1]
  • +
  • cats[1, ]
  • +

Try out these examples and explain what is returned by each one.

+

Hint: Use the function typeof() to examine what +is returned in each case.

+
+
+
+
+
+ +
+
+
+

R +

+
+cats[1]
+
+
+

OUTPUT +

+
    coat
+1 calico
+2  black
+3  tabby
+
+

We can think of a data frame as a list of vectors. The single brace +[1] returns the first slice of the list, as another list. +In this case it is the first column of the data frame.

+
+

R +

+
+cats[[1]]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

The double brace [[1]] returns the contents of the list +item. In this case it is the contents of the first column, a +vector of type character.

+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

This example uses the $ character to address items by +name. coat is the first column of the data frame, again a +vector of type character.

+
+

R +

+
+cats["coat"]
+
+
+

OUTPUT +

+
    coat
+1 calico
+2  black
+3  tabby
+
+

Here we are using a single brace ["coat"] replacing the +index number with the column name. Like example 1, the returned object +is a list.

+
+

R +

+
+cats[1, 1]
+
+
+

OUTPUT +

+
[1] "calico"
+
+

This example uses a single brace, but this time we provide row and +column coordinates. The returned object is the value in row 1, column 1. +The object is a vector of type character.

+
+

R +

+
+cats[, 1]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

Like the previous example we use single braces and provide row and +column coordinates. The row coordinate is not specified, R interprets +this missing value as all the elements in this column and +returns them as a vector.

+
+

R +

+
+cats[1, ]
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+
+

Again we use the single brace with row and column coordinates. The +column coordinate is not specified. The return value is a list +containing all the values in the first row.

+
+
+
+
+
+
+ +
+
+

Tip: Renaming data frame columns +

+
+

Data frames have column names, which can be accessed with the +names() function.

+
+

R +

+
+names(cats)
+
+
+

OUTPUT +

+
[1] "coat"         "weight"       "likes_string"
+
+

If you want to rename the second column of cats, you can +assign a new name to the second element of names(cats).

+
+

R +

+
+names(cats)[2] <- "weight_kg"
+cats
+
+
+

OUTPUT +

+
    coat weight_kg likes_string
+1 calico       2.1         TRUE
+2  black       5.0        FALSE
+3  tabby       3.2         TRUE
+
+
+
+
+
+

Matrices

+

Last but not least is the matrix. We can declare a matrix full of +zeros:

+
+

R +

+
+matrix_example <- matrix(0, ncol=6, nrow=3)
+matrix_example
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    0    0    0    0    0    0
+[2,]    0    0    0    0    0    0
+[3,]    0    0    0    0    0    0
+
+

What makes it special is the dim() attribute:

+
+

R +

+
+dim(matrix_example)
+
+
+

OUTPUT +

+
[1] 3 6
+
+

And similar to other data structures, we can ask things about our +matrix:

+
+

R +

+
+typeof(matrix_example)
+
+
+

OUTPUT +

+
[1] "double"
+
+
+

R +

+
+class(matrix_example)
+
+
+

OUTPUT +

+
[1] "matrix" "array" 
+
+
+

R +

+
+str(matrix_example)
+
+
+

OUTPUT +

+
 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
+
+
+

R +

+
+nrow(matrix_example)
+
+
+

OUTPUT +

+
[1] 3
+
+
+

R +

+
+ncol(matrix_example)
+
+
+

OUTPUT +

+
[1] 6
+
+
+
+ +
+
+

Challenge 6 +

+
+

What do you think will be the result of +length(matrix_example)? Try it. Were you right? Why / why +not?

+
+
+
+
+
+ +
+
+

What do you think will be the result of +length(matrix_example)?

+
+

R +

+
+matrix_example <- matrix(0, ncol=6, nrow=3)
+length(matrix_example)
+
+
+

OUTPUT +

+
[1] 18
+
+

Because a matrix is a vector with added dimension attributes, +length gives you the total number of elements in the +matrix.

+
+
+
+
+
+
+ +
+
+

Challenge 7 +

+
+

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+
+
+
+
+
+ +
+
+

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+
+

R +

+
+x <- matrix(1:50, ncol=5, nrow=10)
+x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
+
+
+
+
+
+
+
+ +
+
+

Challenge 8 +

+
+

Create a list of length two containing a character vector for each of +the sections in this part of the workshop:

+
  • Data types
  • +
  • Data structures
  • +

Populate each character vector with the names of the data types and +data structures we’ve seen so far.

+
+
+
+
+
+ +
+
+
+

R +

+
+dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
+dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
+answer <- list(dataTypes, dataStructures)
+
+

Note: it’s nice to make a list in big writing on the board or taped +to the wall listing all of these types and structures - leave it up for +the rest of the workshop to remind people of the importance of these +basics.

+
+
+
+
+
+
+ +
+
+

Challenge 9 +

+
+

Consider the R output of the matrix below:

+
+

OUTPUT +

+
     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7
+
+

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

+
  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. +
  3. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  4. +
  5. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  6. +
  7. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
  8. +
+
+
+
+
+ +
+
+

Consider the R output of the matrix below:

+
+

OUTPUT +

+
     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7
+
+

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

+
+

R +

+
+matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +
+
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/05-data-structures-part2.html b/instructor/05-data-structures-part2.html new file mode 100644 index 000000000..7e77d7ef2 --- /dev/null +++ b/instructor/05-data-structures-part2.html @@ -0,0 +1,1210 @@ + +R for Reproducible Scientific Analysis: Exploring Data Frames +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Exploring Data Frames

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 30 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I manipulate a data frame?
  • +
+
+
+
+
+
+

Objectives

+
  • Add and remove rows or columns.
  • +
  • Append two data frames.
  • +
  • Display basic properties of data frames including size and class of +the columns, names, and first few rows.
  • +
+
+
+
+
+

At this point, you’ve seen it all: in the last lesson, we toured all +the basic data types and data structures in R. Everything you do will be +a manipulation of those tools. But most of the time, the star of the +show is the data frame—the table that we created by loading information +from a csv file. In this lesson, we’ll learn a few more things about +working with data frames.

+

Adding columns and rows in data frames +

+

We already learned that the columns of a data frame are vectors, so +that our data are consistent in type throughout the columns. As such, if +we want to add a new column, we can start by making a new vector:

+
+

R +

+
+age <- c(2, 3, 5)
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1
+
+

We can then add this as a column via:

+
+

R +

+
+cbind(cats, age)
+
+
+

OUTPUT +

+
    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5
+
+

Note that if we tried to add a vector of ages with a different number +of entries than the number of rows in the data frame, it would fail:

+
+

R +

+
+age <- c(2, 3, 5, 12)
+cbind(cats, age)
+
+
+

ERROR +

+
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4
+
+
+

R +

+
+age <- c(2, 3)
+cbind(cats, age)
+
+
+

ERROR +

+
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
+
+

Why didn’t this work? Of course, R wants to see one element in our +new column for every row in the table:

+
+

R +

+
+nrow(cats)
+
+
+

OUTPUT +

+
[1] 3
+
+
+

R +

+
+length(age)
+
+
+

OUTPUT +

+
[1] 2
+
+

So for it to work we need to have nrow(cats) = +length(age). Let’s overwrite the content of cats with our +new data frame.

+
+

R +

+
+age <- c(2, 3, 5)
+cats <- cbind(cats, age)
+
+

Now how about adding rows? We already know that the rows of a data +frame are lists:

+
+

R +

+
+newRow <- list("tortoiseshell", 3.3, TRUE, 9)
+cats <- rbind(cats, newRow)
+
+

Let’s confirm that our new row was added correctly.

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+
+

Removing rows +

+

We now know how to add rows and columns to our data frame in R. Now +let’s learn to remove rows.

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+
+

We can ask for a data frame minus the last row:

+
+

R +

+
+cats[-4, ]
+
+
+

OUTPUT +

+
    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5
+
+

Notice the comma with nothing after it to indicate that we want to +drop the entire fourth row.

+

Note: we could also remove several rows at once by putting the row +numbers inside of a vector, for example: +cats[c(-3,-4), ]

+

Removing columns +

+

We can also remove columns in our data frame. What if we want to +remove the column “age”. We can remove it in two ways, by variable +number or by index.

+
+

R +

+
+cats[,-4]
+
+
+

OUTPUT +

+
           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1
+
+

Notice the comma with nothing before it, indicating we want to keep +all of the rows.

+

Alternatively, we can drop the column by using the index name and the +%in% operator. The %in% operator goes through +each element of its left argument, in this case the names of +cats, and asks, “Does this element occur in the second +argument?”

+
+

R +

+
+drop <- names(cats) %in% c("age")
+cats[,!drop]
+
+
+

OUTPUT +

+
           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1
+
+

We will cover subsetting with logical operators like +%in% in more detail in the next episode. See the section Subsetting through other logical +operations

+

Appending to a data frame +

+

The key to remember when adding data to a data frame is that +columns are vectors and rows are lists. We can also glue two +data frames together with rbind:

+
+

R +

+
+cats <- rbind(cats, cats)
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+
+

But now the row names are unnecessarily complicated. We can remove +the rownames, and R will automatically re-name them sequentially:

+
+

R +

+
+rownames(cats) <- NULL
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+
+
+
+ +
+
+

Challenge 1 +

+
+

You can create a new data frame right from within R with the +following syntax:

+
+

R +

+
+df <- data.frame(id = c("a", "b", "c"),
+                 x = 1:3,
+                 y = c(TRUE, TRUE, FALSE))
+
+

Make a data frame that holds the following information for +yourself:

+
  • first name
  • +
  • last name
  • +
  • lucky number
  • +

Then use rbind to add an entry for the people sitting +beside you. Finally, use cbind to add a column with each +person’s answer to the question, “Is it time for coffee break?”

+
+
+
+
+
+ +
+
+
+

R +

+
+df <- data.frame(first = c("Grace"),
+                 last = c("Hopper"),
+                 lucky_number = c(0))
+df <- rbind(df, list("Marie", "Curie", 238) )
+df <- cbind(df, coffeetime = c(TRUE,TRUE))
+
+
+
+
+
+

Realistic example +

+

So far, you have seen the basics of manipulating data frames with our +cat data; now let’s use those skills to digest a more realistic dataset. +Let’s read in the gapminder dataset that we downloaded +previously:

+
+

R +

+
+gapminder <- read.csv("data/gapminder_data.csv")
+
+
+
+ +
+
+

Miscellaneous Tips +

+
+
  • Another type of file you might encounter are tab-separated value +files (.tsv). To specify a tab as a separator, use "\\t" or +read.delim().

  • +
  • Files can also be downloaded directly from the Internet into a +local folder of your choice onto your computer using the +download.file function. The read.csv function +can then be executed to read the downloaded file from the download +location, for example,

  • +
+

R +

+
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv("data/gapminder_data.csv")
+
+
  • Alternatively, you can also read in files directly into R from the +Internet by replacing the file paths with a web address in +read.csv. One should note that in doing this no local copy +of the csv file is first saved onto your computer. For example,
  • +
+

R +

+
+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
+
+
  • You can read directly from excel spreadsheets without converting +them to plain text first by using the readxl +package.

  • +
  • The argument “stringsAsFactors” can be useful to tell R how to +read strings either as factors or as character strings. In R versions +after 4.0, all strings are read-in as characters by default, but in +earlier versions of R, strings are read-in as factors by default. For +more information, see the call-out in the +previous episode.

  • +
+
+
+

Let’s investigate gapminder a bit; the first thing we should always +do is check out what the data looks like with str:

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+

An additional method for examining the structure of gapminder is to +use the summary function. This function can be used on +various objects in R. For data frames, summary yields a +numeric, tabular, or descriptive summary of each column. Numeric or +integer columns are described by the descriptive statistics (quartiles +and mean), and character columns by its length, class, and mode.

+
+

R +

+
+summary(gapminder)
+
+
+

OUTPUT +

+
   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  
+
+

Along with the str and summary functions, +we can examine individual columns of the data frame with our +typeof function:

+
+

R +

+
+typeof(gapminder$year)
+
+
+

OUTPUT +

+
[1] "integer"
+
+
+

R +

+
+typeof(gapminder$country)
+
+
+

OUTPUT +

+
[1] "character"
+
+
+

R +

+
+str(gapminder$country)
+
+
+

OUTPUT +

+
 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+
+

We can also interrogate the data frame for information about its +dimensions; remembering that str(gapminder) said there were +1704 observations of 6 variables in gapminder, what do you think the +following will produce, and why?

+
+

R +

+
+length(gapminder)
+
+
+

OUTPUT +

+
[1] 6
+
+

A fair guess would have been to say that the length of a data frame +would be the number of rows it has (1704), but this is not the case; +remember, a data frame is a list of vectors and factors:

+
+

R +

+
+typeof(gapminder)
+
+
+

OUTPUT +

+
[1] "list"
+
+

When length gave us 6, it’s because gapminder is built +out of a list of 6 columns. To get the number of rows and columns in our +dataset, try:

+
+

R +

+
+nrow(gapminder)
+
+
+

OUTPUT +

+
[1] 1704
+
+
+

R +

+
+ncol(gapminder)
+
+
+

OUTPUT +

+
[1] 6
+
+

Or, both at once:

+
+

R +

+
+dim(gapminder)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+

We’ll also likely want to know what the titles of all the columns +are, so we can ask for them later:

+
+

R +

+
+colnames(gapminder)
+
+
+

OUTPUT +

+
[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
+
+

At this stage, it’s important to ask ourselves if the structure R is +reporting matches our intuition or expectations; do the basic data types +reported for each column make sense? If not, we need to sort any +problems out now before they turn into bad surprises down the road, +using what we’ve learned about how R interprets data, and the importance +of strict consistency in how we record our data.

+

Once we’re happy that the data types and structures seem reasonable, +it’s time to start digging into our data proper. Check out the first few +lines:

+
+

R +

+
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+
+
+
+ +
+
+

Challenge 2 +

+
+

It’s good practice to also check the last few lines of your data and +some in the middle. How would you do this?

+

Searching for ones specifically in the middle isn’t too hard, but we +could ask for a few lines at random. How would you code this?

+
+
+
+
+
+ +
+
+

To check the last few lines it’s relatively simple as R already has a +function for this:

+
+

R +

+
+tail(gapminder)
+tail(gapminder, n = 15)
+
+

What about a few arbitrary rows just in case something is odd in the +middle?

+
+

Tip: There are several ways to achieve this.

+

The solution here presents one form of using nested functions, i.e. a +function passed as an argument to another function. This might sound +like a new concept, but you are already using it! Remember +my_dataframe[rows, cols] will print to screen your data frame with the +number of rows and columns you asked for (although you might have asked +for a range or named columns for example). How would you get the last +row if you don’t know how many rows your data frame has? R has a +function for this. What about getting a (pseudorandom) sample? R also +has a function for this.

+
+

R +

+
+gapminder[sample(nrow(gapminder), 5), ]
+
+
+
+
+
+
+

To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later.

+
+
+ +
+
+

Challenge 3 +

+
+

Go to file -> new file -> R script, and write an R script to +load in the gapminder dataset. Put it in the scripts/ +directory and add it to version control.

+

Run the script using the source function, using the file +path as its argument (or by pressing the “source” button in +RStudio).

+
+
+
+
+
+ +
+
+

The source function can be used to use a script within a +script. Assume you would like to load the same type of file over and +over again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again and +again you could just write it once and save it as a script. Then, you +can use source("Your_Script_containing_the_load_function") +in a new script to use the function of that script without writing +everything again. Check out ?source to find out more.

+
+

R +

+
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv(file = "data/gapminder_data.csv")
+
+

To run the script and load the data into the gapminder +variable:

+
+

R +

+
+source(file = "scripts/load-gapminder.R")
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Read the output of str(gapminder) again; this time, use +what you’ve learned about lists and vectors, as well as the output of +functions like colnames and dim to explain +what everything that str prints out for gapminder means. If +there are any parts you can’t interpret, discuss with your +neighbors!

+
+
+
+
+
+ +
+
+

The object gapminder is a data frame with columns

+
  • +country and continent are character +strings.
  • +
  • +year is an integer vector.
  • +
  • +pop, lifeExp, and gdpPercap +are numeric vectors.
  • +
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/06-data-subsetting.html b/instructor/06-data-subsetting.html new file mode 100644 index 000000000..6496f90b1 --- /dev/null +++ b/instructor/06-data-subsetting.html @@ -0,0 +1,1992 @@ + +R for Reproducible Scientific Analysis: Subsetting Data +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Subsetting Data

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 50 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I work with subsets of data in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to subset vectors, factors, matrices, lists, and data +frames
  • +
  • To be able to extract individual and multiple elements: by index, by +name, using comparison operations
  • +
  • To be able to skip and remove elements from various data +structures.
  • +
+
+
+
+
+

R has many powerful subset operators. Mastering them will allow you +to easily perform complex operations on any kind of dataset.

+

There are six different ways we can subset any kind of object, and +three different subsetting operators for the different data +structures.

+

Let’s start with the workhorse of R: a simple numeric vector.

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+x
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+
+
+ +
+
+

Atomic vectors +

+
+

In R, simple vectors containing character strings, numbers, or +logical values are called atomic vectors because they can’t be +further simplified.

+
+
+
+

So now that we’ve created a dummy vector to play with, how do we get +at its contents?

+

Accessing elements using their indices +

+

To extract elements of a vector we can give their corresponding +index, starting from one:

+
+

R +

+
+x[1]
+
+
+

OUTPUT +

+
  a 
+5.4 
+
+
+

R +

+
+x[4]
+
+
+

OUTPUT +

+
  d 
+4.8 
+
+

It may look different, but the square brackets operator is a +function. For vectors (and matrices), it means “get me the nth +element”.

+

We can ask for multiple elements at once:

+
+

R +

+
+x[c(1, 3)]
+
+
+

OUTPUT +

+
  a   c 
+5.4 7.1 
+
+

Or slices of the vector:

+
+

R +

+
+x[1:4]
+
+
+

OUTPUT +

+
  a   b   c   d 
+5.4 6.2 7.1 4.8 
+
+

the : operator creates a sequence of numbers from the +left element to the right.

+
+

R +

+
+1:4
+
+
+

OUTPUT +

+
[1] 1 2 3 4
+
+
+

R +

+
+c(1, 2, 3, 4)
+
+
+

OUTPUT +

+
[1] 1 2 3 4
+
+

We can ask for the same element multiple times:

+
+

R +

+
+x[c(1,1,3)]
+
+
+

OUTPUT +

+
  a   a   c 
+5.4 5.4 7.1 
+
+

If we ask for an index beyond the length of the vector, R will return +a missing value:

+
+

R +

+
+x[6]
+
+
+

OUTPUT +

+
<NA> 
+  NA 
+
+

This is a vector of length one containing an NA, whose +name is also NA.

+

If we ask for the 0th element, we get an empty vector:

+
+

R +

+
+x[0]
+
+
+

OUTPUT +

+
named numeric(0)
+
+
+
+ +
+
+

Vector numbering in R starts at 1 +

+
+

In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1.

+
+
+
+

Skipping and removing elements +

+

If we use a negative number as the index of a vector, R will return +every element except for the one specified:

+
+

R +

+
+x[-2]
+
+
+

OUTPUT +

+
  a   c   d   e 
+5.4 7.1 4.8 7.5 
+
+

We can skip multiple elements:

+
+

R +

+
+x[c(-1, -5)]  # or x[-c(1,5)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+
+ +
+
+

Tip: Order of operations +

+
+

A common trip up for novices occurs when trying to skip slices of a +vector. It’s natural to try to negate a sequence like so:

+
+

R +

+
+x[-1:3]
+
+

This gives a somewhat cryptic error:

+
+

ERROR +

+
Error in x[-1:3]: only 0's may be mixed with negative subscripts
+
+

But remember the order of operations. : is really a +function. It takes its first argument as -1, and its second as 3, so +generates the sequence of numbers: c(-1, 0, 1, 2, 3).

+

The correct solution is to wrap that function call in brackets, so +that the - operator applies to the result:

+
+

R +

+
+x[-(1:3)]
+
+
+

OUTPUT +

+
  d   e 
+4.8 7.5 
+
+
+
+
+

To remove elements from a vector, we need to assign the result back +into the variable:

+
+

R +

+
+x <- x[-4]
+x
+
+
+

OUTPUT +

+
  a   b   c   e 
+5.4 6.2 7.1 7.5 
+
+
+
+ +
+
+

Challenge 1 +

+
+

Given the following code:

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+

Come up with at least 2 different commands that will produce the +following output:

+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+

After you find 2 different commands, compare notes with your +neighbour. Did you have different strategies?

+
+
+
+
+
+ +
+
+
+

R +

+
+x[2:4]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+

R +

+
+x[-c(1,5)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+

R +

+
+x[c(2,3,4)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+
+
+
+

Subsetting by name +

+

We can extract elements by using their name, instead of extracting by +index:

+
+

R +

+
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
+x[c("a", "c")]
+
+
+

OUTPUT +

+
  a   c 
+5.4 7.1 
+
+

This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same!

+

Subsetting through other logical operations +

+

We can also use any logical vector to subset:

+
+

R +

+
+x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
+
+
+

OUTPUT +

+
  c   e 
+7.1 7.5 
+
+

Since comparison operators (e.g. >, +<, ==) evaluate to logical vectors, we can +also use them to succinctly subset vectors: the following statement +gives the same result as the previous one.

+
+

R +

+
+x[x > 7]
+
+
+

OUTPUT +

+
  c   e 
+7.1 7.5 
+
+

Breaking it down, this statement first evaluates x>7, +generating a logical vector +c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the +elements of x corresponding to the TRUE +values.

+

We can use == to mimic the previous method of indexing +by name (remember you have to use == rather than += for comparisons):

+
+

R +

+
+x[names(x) == "a"]
+
+
+

OUTPUT +

+
  a 
+5.4 
+
+
+
+ +
+
+

Tip: Combining logical conditions +

+
+

We often want to combine multiple logical criteria. For example, we +might want to find all the countries that are located in Asia +or Europe and have life expectancies +within a certain range. Several operations for combining logical vectors +exist in R:

+
  • +&, the “logical AND” operator: returns +TRUE if both the left and right are TRUE.
  • +
  • +|, the “logical OR” operator: returns +TRUE, if either the left or right (or both) are +TRUE.
  • +

You may sometimes see && and || +instead of & and |. These two-character +operators only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them for programming, i.e. deciding +whether to execute a statement.

+
  • +!, the “logical NOT” operator: converts +TRUE to FALSE and FALSE to +TRUE. It can negate a single logical condition (eg +!TRUE becomes FALSE), or a whole vector of +conditions(eg !c(TRUE, FALSE) becomes +c(FALSE, TRUE)).
  • +

Additionally, you can compare the elements within a single vector +using the all function (which returns TRUE if +every element of the vector is TRUE) and the +any function (which returns TRUE if one or +more elements of the vector are TRUE).

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Given the following code:

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+

Write a subsetting command to return the values in x that are greater +than 4 and less than 7.

+
+
+
+
+
+ +
+
+
+

R +

+
+x_subset <- x[x<7 & x>4]
+print(x_subset)
+
+
+

OUTPUT +

+
  a   b   d 
+5.4 6.2 4.8 
+
+
+
+
+
+
+
+ +
+
+

Tip: Non-unique names +

+
+

You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have the +same name — although R tries to avoid this — but row names must be +unique.) Consider these examples:

+
+

R +

+
+x <- 1:3
+x
+
+
+

OUTPUT +

+
[1] 1 2 3
+
+
+

R +

+
+names(x) <- c('a', 'a', 'a')
+x
+
+
+

OUTPUT +

+
a a a 
+1 2 3 
+
+
+

R +

+
+x['a']  # only returns first value
+
+
+

OUTPUT +

+
a 
+1 
+
+
+

R +

+
+x[names(x) == 'a']  # returns all three values
+
+
+

OUTPUT +

+
a a a 
+1 2 3 
+
+
+
+
+
+
+ +
+
+

Tip: Getting help for operators +

+
+

Remember you can search for help on operators by wrapping them in +quotes: help("%in%") or ?"%in%".

+
+
+
+

Skipping named elements +

+

Skipping or removing named elements is a little harder. If we try to +skip one named element by negating the string, R complains (slightly +obscurely) that it doesn’t know how to take the negative of a +string:

+
+

R +

+
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
+x[-"a"]
+
+
+

ERROR +

+
Error in -"a": invalid argument to unary operator
+
+

However, we can use the != (not-equals) operator to +construct a logical vector that will do what we want:

+
+

R +

+
+x[names(x) != "a"]
+
+
+

OUTPUT +

+
  b   c   d   e 
+6.2 7.1 4.8 7.5 
+
+

Skipping multiple named indices is a little bit harder still. Suppose +we want to drop the "a" and "c" elements, so +we try this:

+
+

R +

+
+x[names(x)!=c("a","c")]
+
+
+

WARNING +

+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+

OUTPUT +

+
  b   c   d   e 
+6.2 7.1 4.8 7.5 
+
+

R did something, but it gave us a warning that we ought to +pay attention to - and it apparently gave us the wrong answer +(the "c" element is still included in the vector)!

+

So what does != actually do in this case? That’s an +excellent question.

+
+

Recycling

+

Let’s take a look at the comparison component of this code:

+
+

R +

+
+names(x) != c("a", "c")
+
+
+

WARNING +

+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+

OUTPUT +

+
[1] FALSE  TRUE  TRUE  TRUE  TRUE
+
+

Why does R give TRUE as the third element of this +vector, when names(x)[3] != "c" is obviously false? When +you use !=, R tries to compare each element of the left +argument with the corresponding element of its right argument. What +happens when you compare vectors of different lengths?

+
Inequality testing

When one vector is shorter than the other, it gets +recycled:

+
Inequality testing: results of recycling

In this case R repeats c("a", "c") as +many times as necessary to match names(x), i.e. we get +c("a","c","a","c","a"). Since the recycled "a" +doesn’t match the third element of names(x), the value of +!= is TRUE. Because in this case the longer +vector length (5) isn’t a multiple of the shorter vector length (2), R +printed a warning message. If we had been unlucky and +names(x) had contained six elements, R would +silently have done the wrong thing (i.e., not what we intended +it to do). This recycling rule can can introduce hard-to-find and subtle +bugs!

+

The way to get R to do what we really want (match each +element of the left argument with all of the elements of the +right argument) it to use the %in% operator. The +%in% operator goes through each element of its left +argument, in this case the names of x, and asks, “Does this +element occur in the second argument?”. Here, since we want to +exclude values, we also need a ! operator to +change “in” to “not in”:

+
+

R +

+
+x[! names(x) %in% c("a","c") ]
+
+
+

OUTPUT +

+
  b   d   e 
+6.2 4.8 7.5 
+
+
+
+ +
+
+

Challenge 3 +

+
+

Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains country and continent variables, but +no information between these two scales. Suppose we want to pull out +information from southeast Asia: how do we set up an operation to +produce a logical vector that is TRUE for all of the +countries in southeast Asia and FALSE otherwise?

+

Suppose you have these data:

+
+

R +

+
+seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries <- unique(as.character(gapminder$country))
+
+

There’s a wrong way (using only ==), which will give you +a warning; a clunky way (using the logical operators == and +|); and an elegant way (using %in%). See +whether you can come up with all three and explain how they (don’t) +work.

+
+
+
+
+
+ +
+
+
  • The wrong way to do this problem is +countries==seAsia. This gives a warning +("In countries == seAsia : longer object length is not a multiple of shorter object length") +and the wrong answer (a vector of all FALSE values), +because none of the recycled values of seAsia happen to +line up correctly with matching values in country.
  • +
  • The clunky (but technically correct) way to do this +problem is
  • +
+

R +

+
+ (countries=="Myanmar" | countries=="Thailand" |
+ countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")
+
+

(or countries==seAsia[1] | countries==seAsia[2] | ...). +This gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?).

+
  • The best way to do this problem is +countries %in% seAsia, which is both correct and easy to +type (and read).
  • +
+
+
+
+
+

Handling special values +

+

At some point you will encounter functions in R that cannot handle +missing, infinite, or undefined data.

+

There are a number of special functions you can use to filter out +this data:

+
  • +is.na will return all positions in a vector, matrix, or +data.frame containing NA (or NaN)
  • +
  • likewise, is.nan, and is.infinite will do +the same for NaN and Inf.
  • +
  • +is.finite will return all positions in a vector, +matrix, or data.frame that do not contain NA, +NaN or Inf.
  • +
  • +na.omit will filter out all missing values from a +vector
  • +

Factor subsetting +

+

Now that we’ve explored the different ways to subset vectors, how do +we subset the other data structures?

+

Factor subsetting works the same way as vector subsetting.

+
+

R +

+
+f <- factor(c("a", "a", "b", "c", "c", "d"))
+f[f == "a"]
+
+
+

OUTPUT +

+
[1] a a
+Levels: a b c d
+
+
+

R +

+
+f[f %in% c("b", "c")]
+
+
+

OUTPUT +

+
[1] b c c
+Levels: a b c d
+
+
+

R +

+
+f[1:3]
+
+
+

OUTPUT +

+
[1] a a b
+Levels: a b c d
+
+

Skipping elements will not remove the level even if no more of that +category exists in the factor:

+
+

R +

+
+f[-3]
+
+
+

OUTPUT +

+
[1] a a c c d
+Levels: a b c d
+
+

Matrix subsetting +

+

Matrices are also subsetted using the [ function. In +this case it takes two arguments: the first applying to the rows, the +second to its columns:

+
+

R +

+
+set.seed(1)
+m <- matrix(rnorm(6*4), ncol=4, nrow=6)
+m[3:4, c(3,1)]
+
+
+

OUTPUT +

+
            [,1]       [,2]
+[1,]  1.12493092 -0.8356286
+[2,] -0.04493361  1.5952808
+
+

You can leave the first or second arguments blank to retrieve all the +rows or columns respectively:

+
+

R +

+
+m[, c(3,4)]
+
+
+

OUTPUT +

+
            [,1]        [,2]
+[1,] -0.62124058  0.82122120
+[2,] -2.21469989  0.59390132
+[3,]  1.12493092  0.91897737
+[4,] -0.04493361  0.78213630
+[5,] -0.01619026  0.07456498
+[6,]  0.94383621 -1.98935170
+
+

If we only access one row or column, R will automatically convert the +result to a vector:

+
+

R +

+
+m[3,]
+
+
+

OUTPUT +

+
[1] -0.8356286  0.5757814  1.1249309  0.9189774
+
+

If you want to keep the output as a matrix, you need to specify a +third argument; drop = FALSE:

+
+

R +

+
+m[3, , drop=FALSE]
+
+
+

OUTPUT +

+
           [,1]      [,2]     [,3]      [,4]
+[1,] -0.8356286 0.5757814 1.124931 0.9189774
+
+

Unlike vectors, if we try to access a row or column outside of the +matrix, R will throw an error:

+
+

R +

+
+m[, c(3,6)]
+
+
+

ERROR +

+
Error in m[, c(3, 6)]: subscript out of bounds
+
+
+
+ +
+
+

Tip: Higher dimensional arrays +

+
+

when dealing with multi-dimensional arrays, each argument to +[ corresponds to a dimension. For example, a 3D array, the +first three arguments correspond to the rows, columns, and depth +dimension.

+
+
+
+

Because matrices are vectors, we can also subset using only one +argument:

+
+

R +

+
+m[5]
+
+
+

OUTPUT +

+
[1] 0.3295078
+
+

This usually isn’t useful, and often confusing to read. However it is +useful to note that matrices are laid out in column-major +format by default. That is the elements of the vector are arranged +column-wise:

+
+

R +

+
+matrix(1:6, nrow=2, ncol=3)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3]
+[1,]    1    3    5
+[2,]    2    4    6
+
+

If you wish to populate the matrix by row, use +byrow=TRUE:

+
+

R +

+
+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3]
+[1,]    1    2    3
+[2,]    4    5    6
+
+

Matrices can also be subsetted using their rownames and column names +instead of their row and column indices.

+
+
+ +
+
+

Challenge 4 +

+
+

Given the following code:

+
+

R +

+
+m <- matrix(1:18, nrow=3, ncol=6)
+print(m)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    1    4    7   10   13   16
+[2,]    2    5    8   11   14   17
+[3,]    3    6    9   12   15   18
+
+
  1. Which of the following commands will extract the values 11 and +14?
  2. +

A. m[2,4,2,5]

+

B. m[2:5]

+

C. m[4:5,2]

+

D. m[2,c(4,5)]

+
+
+
+
+
+ +
+
+

D

+
+
+
+
+

List subsetting +

+

Now we’ll introduce some new subsetting operators. There are three +functions used to subset lists. We’ve already seen these when learning +about atomic vectors and matrices: [, [[, and +$.

+

Using [ will always return a list. If you want to +subset a list, but not extract an element, then you +will likely use [.

+
+

R +

+
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+xlist[1]
+
+
+

OUTPUT +

+
$a
+[1] "Software Carpentry"
+
+

This returns a list with one element.

+

We can subset elements of a list exactly the same way as atomic +vectors using [. Comparison operations however won’t work +as they’re not recursive, they will try to condition on the data +structures in each element of the list, not the individual elements +within those data structures.

+
+

R +

+
+xlist[1:2]
+
+
+

OUTPUT +

+
$a
+[1] "Software Carpentry"
+
+$b
+ [1]  1  2  3  4  5  6  7  8  9 10
+
+

To extract individual elements of a list, you need to use the +double-square bracket function: [[.

+
+

R +

+
+xlist[[1]]
+
+
+

OUTPUT +

+
[1] "Software Carpentry"
+
+

Notice that now the result is a vector, not a list.

+

You can’t extract more than one element at once:

+
+

R +

+
+xlist[[1:2]]
+
+
+

ERROR +

+
Error in xlist[[1:2]]: subscript out of bounds
+
+

Nor use it to skip elements:

+
+

R +

+
+xlist[[-1]]
+
+
+

ERROR +

+
Error in xlist[[-1]]: invalid negative subscript in get1index <real>
+
+

But you can use names to both subset and extract elements:

+
+

R +

+
+xlist[["a"]]
+
+
+

OUTPUT +

+
[1] "Software Carpentry"
+
+

The $ function is a shorthand way for extracting +elements by name:

+
+

R +

+
+xlist$data
+
+
+

OUTPUT +

+
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
+Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+
+
+
+ +
+
+

Challenge 5 +

+
+

Given the following list:

+
+

R +

+
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+
+

Using your knowledge of both list and vector subsetting, extract the +number 2 from xlist. Hint: the number 2 is contained within the “b” item +in the list.

+
+
+
+
+
+ +
+
+
+

R +

+
+xlist$b[2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+

R +

+
+xlist[[2]][2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+

R +

+
+xlist[["b"]][2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+
+
+
+
+
+ +
+
+

Challenge 6 +

+
+

Given a linear model:

+
+

R +

+
+mod <- aov(pop ~ lifeExp, data=gapminder)
+
+

Extract the residual degrees of freedom (hint: +attributes() will help you)

+
+
+
+
+
+ +
+
+
+

R +

+
+attributes(mod) ## `df.residual` is one of the names of `mod`
+
+
+

R +

+
+mod$df.residual
+
+
+
+
+
+

Data frames +

+

Remember the data frames are lists underneath the hood, so similar +rules apply. However they are also two dimensional objects:

+

[ with one argument will act the same way as for lists, +where each list element corresponds to a column. The resulting object +will be a data frame:

+
+

R +

+
+head(gapminder[3])
+
+
+

OUTPUT +

+
       pop
+1  8425333
+2  9240934
+3 10267083
+4 11537966
+5 13079460
+6 14880372
+
+

Similarly, [[ will act to extract a single +column:

+
+

R +

+
+head(gapminder[["lifeExp"]])
+
+
+

OUTPUT +

+
[1] 28.801 30.332 31.997 34.020 36.088 38.438
+
+

And $ provides a convenient shorthand to extract columns +by name:

+
+

R +

+
+head(gapminder$year)
+
+
+

OUTPUT +

+
[1] 1952 1957 1962 1967 1972 1977
+
+

With two arguments, [ behaves the same way as for +matrices:

+
+

R +

+
+gapminder[1:3,]
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+
+

If we subset a single row, the result will be a data frame (because +the elements are mixed types):

+
+

R +

+
+gapminder[3,]
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+
+

But for a single column the result will be a vector (this can be +changed with the third argument, drop = FALSE).

+
+
+ +
+
+

Challenge 7 +

+
+

Fix each of the following common data frame subsetting errors:

+
  1. Extract observations collected for the year 1957
  2. +
+

R +

+
gapminder[gapminder$year = 1957,]
+
+
  1. Extract all columns except 1 through to 4
  2. +
+

R +

+
+gapminder[,-1:4]
+
+
  1. Extract the rows where the life expectancy is longer the 80 +years
  2. +
+

R +

+
+gapminder[gapminder$lifeExp > 80]
+
+
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +
+

R +

+
+gapminder[1, 4, 5]
+
+
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +
+

R +

+
+gapminder[gapminder$year == 2002 | 2007,]
+
+
+
+
+
+
+ +
+
+

Fix each of the following common data frame subsetting errors:

+
  1. Extract observations collected for the year 1957
  2. +
+

R +

+
+# gapminder[gapminder$year = 1957,]
+gapminder[gapminder$year == 1957,]
+
+
  1. Extract all columns except 1 through to 4
  2. +
+

R +

+
+# gapminder[,-1:4]
+gapminder[,-c(1:4)]
+
+
  1. Extract the rows where the life expectancy is longer than 80 +years
  2. +
+

R +

+
+# gapminder[gapminder$lifeExp > 80]
+gapminder[gapminder$lifeExp > 80,]
+
+
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +
+

R +

+
+# gapminder[1, 4, 5]
+gapminder[1, c(4, 5)]
+
+
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +
+

R +

+
+# gapminder[gapminder$year == 2002 | 2007,]
+gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
+gapminder[gapminder$year %in% c(2002, 2007),]
+
+
+
+
+
+
+
+ +
+
+

Challenge 8 +

+
+
  1. Why does gapminder[1:20] return an error? How does +it differ from gapminder[1:20, ]?

  2. +
  3. Create a new data.frame called +gapminder_small that only contains rows 1 through 9 and 19 +through 23. You can do this in one or two steps.

  4. +
+
+
+
+
+ +
+
+
  1. gapminder is a data.frame so needs to be subsetted +on two dimensions. gapminder[1:20, ] subsets the data to +give the first 20 rows and all columns.

  2. +
  3. +
  4. +
+

R +

+
+gapminder_small <- gapminder[c(1:9, 19:23),]
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/07-control-flow.html b/instructor/07-control-flow.html new file mode 100644 index 000000000..626f3d683 --- /dev/null +++ b/instructor/07-control-flow.html @@ -0,0 +1,1248 @@ + +R for Reproducible Scientific Analysis: Control Flow +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Control Flow

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 65 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I make data-dependent choices in R?
  • +
  • How can I repeat operations in R?
  • +
+
+
+
+
+
+

Objectives

+
  • Write conditional statements with if...else statements +and ifelse().
  • +
  • Write and understand for() loops.
  • +
+
+
+
+
+

Often when we’re coding we want to control the flow of our actions. +This can be done by setting actions to occur only if a condition or a +set of conditions are met. Alternatively, we can also set an action to +occur a particular number of times.

+

There are several ways you can control flow in R. For conditional +statements, the most commonly used approaches are the constructs:

+
+

R +

+
# if
+if (condition is true) {
+  perform action
+}
+
+# if ... else
+if (condition is true) {
+  perform action
+} else {  # that is, if the condition is false,
+  perform alternative action
+}
+
+

Say, for example, that we want R to print a message if a variable +x has a particular value:

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+}
+
+x
+
+
+

OUTPUT +

+
[1] 8
+
+

The print statement does not appear in the console because x is not +greater than 10. To print a different message for numbers less than 10, +we can add an else statement.

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else {
+  print("x is less than 10")
+}
+
+
+

OUTPUT +

+
[1] "x is less than 10"
+
+

You can also test multiple conditions by using +else if.

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else if (x > 5) {
+  print("x is greater than 5, but less than 10")
+} else {
+  print("x is less than 5")
+}
+
+
+

OUTPUT +

+
[1] "x is greater than 5, but less than 10"
+
+

Important: when R evaluates the condition inside +if() statements, it is looking for a logical element, i.e., +TRUE or FALSE. This can cause some headaches +for beginners. For example:

+
+

R +

+
+x  <-  4 == 3
+if (x) {
+  "4 equals 3"
+} else {
+  "4 does not equal 3"
+}
+
+
+

OUTPUT +

+
[1] "4 does not equal 3"
+
+

As we can see, the not equal message was printed because the vector x +is FALSE

+
+

R +

+
+x <- 4 == 3
+x
+
+
+

OUTPUT +

+
[1] FALSE
+
+
+
+ +
+
+

Challenge 1 +

+
+

Use an if() statement to print a suitable message +reporting whether there are any records from 2002 in the +gapminder dataset. Now do the same for 2012.

+
+
+
+
+
+ +
+
+

We will first see a solution to Challenge 1 which does not use the +any() function. We first obtain a logical vector describing +which element of gapminder$year is equal to +2002:

+
+

R +

+
+gapminder[(gapminder$year == 2002),]
+
+

Then, we count the number of rows of the data.frame +gapminder that correspond to the 2002:

+
+

R +

+
+rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])
+
+

The presence of any record for the year 2002 is equivalent to the +request that rows2002_number is one or more:

+
+

R +

+
+rows2002_number >= 1
+
+

Putting all together, we obtain:

+
+

R +

+
+if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
+   print("Record(s) for the year 2002 found.")
+}
+
+

All this can be done more quickly with any(). The +logical condition can be expressed as:

+
+

R +

+
+if(any(gapminder$year == 2002)){
+   print("Record(s) for the year 2002 found.")
+}
+
+
+
+
+
+

Did anyone get a warning message like this?

+
+

ERROR +

+
Error in if (gapminder$year == 2012) {: the condition has length > 1
+
+

The if() function only accepts singular (of length 1) +inputs, and therefore returns an error when you use it with a vector. +The if() function will still run, but will only evaluate +the condition in the first element of the vector. Therefore, to use the +if() function, you need to make sure your input is singular +(of length 1).

+
+
+ +
+
+

Tip: Built in ifelse() +function +

+
+

R accepts both if() and +else if() statements structured as outlined above, but also +statements using R’s built-in ifelse() +function. This function accepts both singular and vector inputs and is +structured as follows:

+
+

R +

+
# ifelse function
+ifelse(condition is true, perform action, perform alternative action)
+
+

where the first argument is the condition or a set of conditions to +be met, the second argument is the statement that is evaluated when the +condition is TRUE, and the third statement is the statement +that is evaluated when the condition is FALSE.

+
+

R +

+
+y <- -3
+ifelse(y < 0, "y is a negative number", "y is either positive or zero")
+
+
+

OUTPUT +

+
[1] "y is a negative number"
+
+
+
+
+
+
+ +
+
+

Tip: any() and +all() +

+
+

The any() function will return TRUE if at +least one TRUE value is found within a vector, otherwise it +will return FALSE. This can be used in a similar way to the +%in% operator. The function all(), as the name +suggests, will only return TRUE if all values in the vector +are TRUE.

+
+
+
+

Repeating operations +

+

If you want to iterate over a set of values, when the order of +iteration is important, and perform the same operation on each, a +for() loop will do the job. We saw for() loops +in the shell +lessons earlier. This is the most flexible of looping operations, +but therefore also the hardest to use correctly. In general, the advice +of many R users would be to learn about for() +loops, but to avoid using for() loops unless the order of +iteration is important: i.e. the calculation at each iteration depends +on the results of previous iterations. If the order of iteration is not +important, then you should learn about vectorized alternatives, such as +the purrr package, as they pay off in computational +efficiency.

+

The basic structure of a for() loop is:

+
+

R +

+
for (iterator in set of values) {
+  do a thing
+}
+
+

For example:

+
+

R +

+
+for (i in 1:10) {
+  print(i)
+}
+
+
+

OUTPUT +

+
[1] 1
+[1] 2
+[1] 3
+[1] 4
+[1] 5
+[1] 6
+[1] 7
+[1] 8
+[1] 9
+[1] 10
+
+

The 1:10 bit creates a vector on the fly; you can +iterate over any other vector as well.

+

We can use a for() loop nested within another +for() loop to iterate over two things at once.

+
+

R +

+
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    print(paste(i,j))
+  }
+}
+
+
+

OUTPUT +

+
[1] "1 a"
+[1] "1 b"
+[1] "1 c"
+[1] "1 d"
+[1] "1 e"
+[1] "2 a"
+[1] "2 b"
+[1] "2 c"
+[1] "2 d"
+[1] "2 e"
+[1] "3 a"
+[1] "3 b"
+[1] "3 c"
+[1] "3 d"
+[1] "3 e"
+[1] "4 a"
+[1] "4 b"
+[1] "4 c"
+[1] "4 d"
+[1] "4 e"
+[1] "5 a"
+[1] "5 b"
+[1] "5 c"
+[1] "5 d"
+[1] "5 e"
+
+

We notice in the output that when the first index (i) is +set to 1, the second index (j) iterates through its full +set of indices. Once the indices of j have been iterated +through, then i is incremented. This process continues +until the last index has been used for each for() loop.

+

Rather than printing the results, we could write the loop output to a +new object.

+
+

R +

+
+output_vector <- c()
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    temp_output <- paste(i, j)
+    output_vector <- c(output_vector, temp_output)
+  }
+}
+output_vector
+
+
+

OUTPUT +

+
 [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b"
+[13] "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b" "5 c" "5 d"
+[25] "5 e"
+
+

This approach can be useful, but ‘growing your results’ (building the +result object incrementally) is computationally inefficient, so avoid it +when you are iterating through a lot of values.

+
+
+ +
+
+

Tip: don’t grow your results +

+
+

One of the biggest things that trips up novices and experienced R +users alike, is building a results object (vector, list, matrix, data +frame) as your for loop progresses. Computers are very bad at handling +this, so your calculations can very quickly slow to a crawl. It’s much +better to define an empty results object before hand of appropriate +dimensions, rather than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location.

+
+
+
+

A better way is to define your (empty) output object before filling +in the values. For this example, it looks more involved, but is still +more efficient.

+
+

R +

+
+output_matrix <- matrix(nrow = 5, ncol = 5)
+j_vector <- c('a', 'b', 'c', 'd', 'e')
+for (i in 1:5) {
+  for (j in 1:5) {
+    temp_j_value <- j_vector[j]
+    temp_output <- paste(i, temp_j_value)
+    output_matrix[i, j] <- temp_output
+  }
+}
+output_vector2 <- as.vector(output_matrix)
+output_vector2
+
+
+

OUTPUT +

+
 [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c" "2 c"
+[13] "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e" "3 e" "4 e"
+[25] "5 e"
+
+
+
+ +
+
+

Tip: While loops +

+
+

Sometimes you will find yourself needing to repeat an operation as +long as a certain condition is met. You can do this with a +while() loop.

+
+

R +

+
while(this condition is true){
+  do a thing
+}
+
+

R will interpret a condition being met as “TRUE”.

+

As an example, here’s a while loop that generates random numbers from +a uniform distribution (the runif() function) between 0 and +1 until it gets one that’s less than 0.1.

+
+

R +

+
+z <- 1
+while(z > 0.1){
+  z <- runif(1)
+  cat(z, "\n")
+}
+
+

while() loops will not always be appropriate. You have +to be particularly careful that you don’t end up stuck in an infinite +loop because your condition is always met and hence the while statement +never terminates.

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Compare the objects output_vector and +output_vector2. Are they the same? If not, why not? How +would you change the last block of code to make +output_vector2 the same as output_vector?

+
+
+
+
+
+ +
+
+

We can check whether the two vectors are identical using the +all() function:

+
+

R +

+
+all(output_vector == output_vector2)
+
+

However, all the elements of output_vector can be found +in output_vector2:

+
+

R +

+
+all(output_vector %in% output_vector2)
+
+

and vice versa:

+
+

R +

+
+all(output_vector2 %in% output_vector)
+
+

therefore, the element in output_vector and +output_vector2 are just sorted in a different order. This +is because as.vector() outputs the elements of an input +matrix going over its column. Taking a look at +output_matrix, we can notice that we want its elements by +rows. The solution is to transpose the output_matrix. We +can do it either by calling the transpose function t() or +by inputting the elements in the right order. The first solution +requires to change the original

+
+

R +

+
+output_vector2 <- as.vector(output_matrix)
+
+

into

+
+

R +

+
+output_vector2 <- as.vector(t(output_matrix))
+
+

The second solution requires to change

+
+

R +

+
+output_matrix[i, j] <- temp_output
+
+

into

+
+

R +

+
+output_matrix[j, i] <- temp_output
+
+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Write a script that loops through the gapminder data by +continent and prints out whether the mean life expectancy is smaller or +larger than 50 years.

+
+
+
+
+
+ +
+
+

Step 1: We want to make sure we can extract all the +unique values of the continent vector

+
+

R +

+
+gapminder <- read.csv("data/gapminder_data.csv")
+unique(gapminder$continent)
+
+

Step 2: We also need to loop over each of these +continents and calculate the average life expectancy for each +subset of data. We can do that as follows:

+
  1. Loop over each of the unique values of ‘continent’
  2. +
  3. For each value of continent, create a temporary variable storing +that subset
  4. +
  5. Return the calculated life expectancy to the user by printing the +output:
  6. +
+

R +

+
+for (iContinent in unique(gapminder$continent)) {
+  tmp <- gapminder[gapminder$continent == iContinent, ]
+  cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n")
+  rm(tmp)
+}
+
+

Step 3: The exercise only wants the output printed +if the average life expectancy is less than 50 or greater than 50. So we +need to add an if() condition before printing, which +evaluates whether the calculated average life expectancy is above or +below a threshold, and prints an output conditional on the result. We +need to amend (3) from above:

+

3a. If the calculated life expectancy is less than some threshold (50 +years), return the continent and a statement that life expectancy is +less than threshold, otherwise return the continent and a statement that +life expectancy is greater than threshold:

+
+

R +

+
+thresholdValue <- 50
+
+for (iContinent in unique(gapminder$continent)) {
+   tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"])
+
+   if (tmp < thresholdValue){
+       cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+   } else {
+       cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+   } # end if else condition
+   rm(tmp)
+} # end for loop
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Modify the script from Challenge 3 to loop over each country. This +time print out whether the life expectancy is smaller than 50, between +50 and 70, or greater than 70.

+
+
+
+
+
+ +
+
+

We modify our solution to Challenge 3 by now adding two thresholds, +lowerThreshold and upperThreshold and +extending our if-else statements:

+
+

R +

+
+ lowerThreshold <- 50
+ upperThreshold <- 70
+
+for (iCountry in unique(gapminder$country)) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+    if(tmp < lowerThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
+    } else if(tmp > lowerThreshold && tmp < upperThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
+    } else {
+        cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
+    }
+    rm(tmp)
+}
+
+
+
+
+
+
+
+ +
+
+

Challenge 5 - Advanced +

+
+

Write a script that loops over each country in the +gapminder dataset, tests whether the country starts with a +‘B’, and graphs life expectancy against time as a line graph if the mean +life expectancy is under 50 years.

+
+
+
+
+
+ +
+
+

We will use the grep() command that was introduced in +the Unix +Shell lesson to find countries that start with “B.” Lets understand +how to do this first. Following from the Unix shell section we may be +tempted to try the following

+
+

R +

+
+grep("^B", unique(gapminder$country))
+
+

But when we evaluate this command it returns the indices of the +factor variable country that start with “B.” To get the +values, we must add the value=TRUE option to the +grep() command:

+
+

R +

+
+grep("^B", unique(gapminder$country), value = TRUE)
+
+

We will now store these countries in a variable called +candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each +country, and if the average life expectancy is less than 50 we use +base-plot to plot the evolution of average life expectancy using +with() and subset():

+
+

R +

+
+thresholdValue <- 50
+candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE)
+
+for (iCountry in candidateCountries) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+    if (tmp < thresholdValue) {
+        cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+
+        with(subset(gapminder, country == iCountry),
+                plot(year, lifeExp,
+                     type = "o",
+                     main = paste("Life Expectancy in", iCountry, "over time"),
+                     ylab = "Life Expectancy",
+                     xlab = "Year"
+                     ) # end plot
+             ) # end with
+    } # end if
+    rm(tmp)
+} # end for loop
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/08-plot-ggplot2.html b/instructor/08-plot-ggplot2.html new file mode 100644 index 000000000..d82021e2e --- /dev/null +++ b/instructor/08-plot-ggplot2.html @@ -0,0 +1,1106 @@ + +R for Reproducible Scientific Analysis: Creating Publication-Quality Graphics with ggplot2 +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Creating Publication-Quality Graphics with ggplot2

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 80 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I create publication-quality graphics in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to use ggplot2 to generate publication-quality +graphics.
  • +
  • To apply geometry, aesthetic, and statistics layers to a ggplot +plot.
  • +
  • To manipulate the aesthetics of a plot using different colors, +shapes, and lines.
  • +
  • To improve data visualization through transforming scales and +paneling by group.
  • +
  • To save a plot created with ggplot to disk.
  • +
+
+
+
+
+

Plotting our data is one of the best ways to quickly explore it and +the various relationships between variables.

+

There are three main plotting systems in R, the base plotting +system, the lattice +package, and the ggplot2 +package.

+

Today we’ll be learning about the ggplot2 package, because it is the +most effective for creating publication-quality graphics.

+

ggplot2 is built on the grammar of graphics, the idea that any plot +can be built from the same set of components: a data +set, mapping aesthetics, and graphical +layers:

+
  • Data sets are the data that you, the user, +provide.

  • +
  • Mapping aesthetics are what connect the data to +the graphics. They tell ggplot2 how to use your data to affect how the +graph looks, such as changing what is plotted on the X or Y axis, or the +size or color of different data points.

  • +
  • Layers are the actual graphical output from +ggplot2. Layers determine what kinds of plot are shown (scatterplot, +histogram, etc.), the coordinate system used (rectangular, polar, +others), and other important aspects of the plot. The idea of layers of +graphics may be familiar to you if you have used image editing programs +like Photoshop, Illustrator, or Inkscape.

  • +

Let’s start off building an example using the gapminder data from +earlier. The most basic function is ggplot, which lets R +know that we’re creating a new plot. Any of the arguments we give the +ggplot function are the global options for the +plot: they apply to all layers on the plot.

+
+

R +

+
+library("ggplot2")
+ggplot(data = gapminder)
+
+
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to +show on our figure. This is not enough information for +ggplot to actually draw anything. It only creates a blank +slate for other elements to be added to.

+

Now we’re going to add in the mapping aesthetics +using the aes function. aes tells +ggplot how variables in the data map to +aesthetic properties of the figure, such as which columns of +the data should be used for the x and +y locations.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
+
+
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

Here we told ggplot we want to plot the “gdpPercap” +column of the gapminder data frame on the x-axis, and the “lifeExp” +column on the y-axis. Notice that we didn’t need to explicitly pass +aes these columns +(e.g. x = gapminder[, "gdpPercap"]), this is because +ggplot is smart enough to know to look in the +data for that column!

+

The final part of making our plot is to tell ggplot how +we want to visually represent the data. We do this by adding a new +layer to the plot using one of the +geom functions.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
+
+
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

Here we used geom_point, which tells ggplot +we want to visually represent the relationship between +x and y as a scatterplot of +points.

+
+
+ +
+
+

Challenge 1 +

+
+

Modify the example so that the figure shows how life expectancy has +changed over time:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()
+
+

Hint: the gapminder dataset has a column called “year”, which should +appear on the x-axis.

+
+
+
+
+
+ +
+
+

Here is one possible solution:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
+
+
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

In the previous examples and challenge we’ve used the +aes function to tell the scatterplot geom +about the x and y locations of each +point. Another aesthetic property we can modify is the point +color. Modify the code from the previous challenge to +color the points by the “continent” column. What trends +do you see in the data? Are they what you expected?

+
+
+
+
+
+ +
+
+

The solution presented below adds color=continent to the +call of the aes function. The general trend seems to +indicate an increased life expectancy over the years. On continents with +stronger economies we find a longer life expectancy.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_point()
+
+
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +
+
+
+
+

Layers +

+

Using a scatterplot probably isn’t the best for visualizing change +over time. Instead, let’s tell ggplot to visualize the data +as a line plot:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
+  geom_line()
+
+

Instead of adding a geom_point layer, we’ve added a +geom_line layer.

+

However, the result doesn’t look quite as we might have expected: it +seems to be jumping around a lot in each continent. Let’s try to +separate the data by country, plotting one line for each country:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line()
+
+

We’ve added the group aesthetic, which +tells ggplot to draw a line for each country.

+

But what if we want to visualize both lines and points on the plot? +We can add another layer to the plot:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line() + geom_point()
+
+

It’s important to note that each layer is drawn on top of the +previous layer. In this example, the points have been drawn on top +of the lines. Here’s a demonstration:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+  geom_line(mapping = aes(color=continent)) + geom_point()
+
+

In this example, the aesthetic mapping of +color has been moved from the global plot options in +ggplot to the geom_line layer so it no longer +applies to the points. Now we can clearly see that the points are drawn +on top of the lines.

+
+
+ +
+
+

Tip: Setting an aesthetic to a value instead +of a mapping +

+
+

So far, we’ve seen how to use an aesthetic (such as +color) as a mapping to a variable in the data. +For example, when we use +geom_line(mapping = aes(color=continent)), ggplot will give +a different color to each continent. But what if we want to change the +color of all lines to blue? You may think that +geom_line(mapping = aes(color="blue")) should work, but it +doesn’t. Since we don’t want to create a mapping to a specific variable, +we can move the color specification outside of the aes() +function, like this: geom_line(color="blue").

+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Switch the order of the point and line layers from the previous +example. What happened?

+
+
+
+
+
+ +
+
+

The lines now get drawn over the points!

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_point() + geom_line(mapping = aes(color=continent))
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+
+
+
+

Transformations and statistics +

+

ggplot2 also makes it easy to overlay statistical models over the +data. To demonstrate we’ll go back to our first example:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
+
+

Currently it’s hard to see the relationship between the points due to +some strong outliers in GDP per capita. We can change the scale of units +on the x axis using the scale functions. These control the +mapping between the data values and visual values of an aesthetic. We +can also modify the transparency of the points, using the alpha +function, which is especially helpful when you have a large amount of +data which is very clustered.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10()
+
+
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

The scale_x_log10 function applied a transformation to +the coordinate system of the plot, so that each multiple of 10 is evenly +spaced from left to right. For example, a GDP per capita of 1,000 is the +same horizontal distance away from a value of 10,000 as the 10,000 value +is from 100,000. This helps to visualize the spread of the data along +the x-axis.

+
+
+ +
+
+

Tip Reminder: Setting an aesthetic to a value +instead of a mapping +

+
+

Notice that we used geom_point(alpha = 0.5). As the +previous tip mentioned, using a setting outside of the +aes() function will cause this value to be used for all +points, which is what we want in this case. But just like any other +aesthetic setting, alpha can also be mapped to a variable in +the data. For example, we can give a different transparency to each +continent with +geom_point(mapping = aes(alpha = continent)).

+
+
+
+

We can fit a simple relationship to the data by adding another layer, +geom_smooth:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the +size aesthetic in the geom_smooth +layer:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", size=1.5)
+
+
+

WARNING +

+
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+generated.
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we +set the size aesthetic by passing it as an +argument to geom_smooth. Previously in the lesson we’ve +used the aes function to define a mapping between +data variables and their visual representation.

+
+
+ +
+
+

Challenge 4a +

+
+

Modify the color and size of the points on the point layer in the +previous example.

+

Hint: do not use the aes function.

+
+
+
+
+
+ +
+
+

Here a possible solution: Notice that the color argument +is supplied outside of the aes() function. This means that +it applies to all data points on the graph and is not related to a +specific variable.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(size=3, color="orange") + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+
+
+
+
+
+ +
+
+

Challenge 4b +

+
+

Modify your solution to Challenge 4a so that the points are now a +different shape and are colored by continent with new trendlines. Hint: +The color argument can be used inside the aesthetic.

+
+
+
+
+
+ +
+
+

Here is a possible solution: Notice that supplying the +color argument inside the aes() functions +enables you to connect it to a certain variable. The shape +argument, as you can see, modifies all data points the same way (it is +outside the aes() call) while the color +argument which is placed inside the aes() call modifies a +point’s color based on its continent value.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
+ geom_point(size=3, shape=17) + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
+
+
+

Multi-panel figures +

+

Earlier we visualized the change in life expectancy over time across +all countries in one plot. Alternatively, we can split this out over +multiple panels by adding a layer of facet panels.

+
+
+ +
+
+

Tip +

+
+

We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to clutter +the figure. Note that we apply a “theme” definition to rotate the x-axis +labels to maintain readability. Nearly everything in ggplot2 is +customizable.

+
+
+
+
+

R +

+
+americas <- gapminder[gapminder$continent == "Americas",]
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

The facet_wrap layer took a “formula” as its argument, +denoted by the tilde (~). This tells R to draw a panel for each unique +value in the country column of the gapminder dataset.

+

Modifying text +

+

To clean this figure up for a publication we need to change some of +the text elements. The x-axis is too cluttered, and the y axis should +read “Life expectancy”, rather than the column name in the data +frame.

+

We can do this by adding a couple of different layers. The +theme layer controls the axis text, and overall text +size. Labels for the axes, plot title and any legend can be set using +the labs function. Legend titles are set using the same +names we used in the aes specification. Thus below the +color legend title is set using color = "Continent", while +the title of a fill legend would be set using +fill = "MyTitle".

+
+

R +

+
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+
+

Exporting the plot +

+

The ggsave() function allows you to export a plot +created with ggplot. You can specify the dimension and resolution of +your plot by adjusting the appropriate arguments (width, +height and dpi) to create high quality +graphics for publication. In order to save the plot from above, we first +assign it to a variable lifeExp_plot, then tell +ggsave to save that plot in png format to a +directory called results. (Make sure you have a +results/ folder in your working directory.)

+
+

R +

+
+lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+
+ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")
+
+

There are two nice things about ggsave. First, it +defaults to the last plot, so if you omit the plot argument +it will automatically save the last plot you created with +ggplot. Secondly, it tries to determine the format you want +to save your plot in from the file extension you provide for the +filename (for example .png or .pdf). If you +need to, you can specify the format explicitly in the +device argument.

+

This is a taste of what you can do with ggplot2. RStudio provides a +really useful cheat +sheet of the different layers available, and more extensive +documentation is available on the ggplot2 website. All +RStudio cheat sheets can be found here. Finally, +if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow +with reusable code to modify!

+
+
+ +
+
+

Challenge 5 +

+
+

Generate boxplots to compare life expectancy between the different +continents during the available years.

+

Advanced:

+
  • Rename y axis as Life Expectancy.
  • +
  • Remove x axis labels.
  • +
+
+
+
+
+ +
+
+

Here a possible solution: xlab() and ylab() +set labels for the x and y axes, respectively The axis title, text and +ticks are attributes of the theme and must be modified within a +theme() call.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
+ geom_boxplot() + facet_wrap(~year) +
+ ylab("Life Expectancy") +
+ theme(axis.title.x=element_blank(),
+       axis.text.x = element_blank(),
+       axis.ticks.x = element_blank())
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/09-vectorization.html b/instructor/09-vectorization.html new file mode 100644 index 000000000..d8750ac6f --- /dev/null +++ b/instructor/09-vectorization.html @@ -0,0 +1,1021 @@ + +R for Reproducible Scientific Analysis: Vectorization +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Vectorization

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 25 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I operate on all the elements of a vector at once?
  • +
+
+
+
+
+
+

Objectives

+
  • To understand vectorized operations in R.
  • +
+
+
+
+
+

Most of R’s functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through and +act on each element one at a time. This makes writing code more concise, +easy to read, and less error prone.

+
+

R +

+
+x <- 1:4
+x * 2
+
+
+

OUTPUT +

+
[1] 2 4 6 8
+
+

The multiplication happened to each element of the vector.

+

We can also add two vectors together:

+
+

R +

+
+y <- 6:9
+x + y
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+

Each element of x was added to its corresponding element +of y:

+
+

R +

+
x:  1  2  3  4
+    +  +  +  +
+y:  6  7  8  9
+---------------
+    7  9 11 13
+
+

Here is how we would add two vectors together using a for loop:

+
+

R +

+
+output_vector <- c()
+for (i in 1:4) {
+  output_vector[i] <- x[i] + y[i]
+}
+output_vector
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+

Compare this to the output using vectorised operations.

+
+

R +

+
+sum_xy <- x + y
+sum_xy
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+
+
+ +
+
+

Challenge 1 +

+
+

Let’s try this on the pop column of the +gapminder dataset.

+

Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+
+
+
+
+
+ +
+
+

Let’s try this on the pop column of the +gapminder dataset.

+

Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+
+

R +

+
+gapminder$pop_millions <- gapminder$pop / 1e6
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap pop_millions
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453     8.425333
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530     9.240934
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007    10.267083
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971    11.537966
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811    13.079460
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134    14.880372
+
+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

On a single graph, plot population, in millions, against year, for +all countries. Do not worry about identifying which country is +which.

+

Repeat the exercise, graphing only for China, India, and Indonesia. +Again, do not worry about which is which.

+
+
+
+
+
+ +
+
+

Refresh your plotting skills by plotting population in millions +against year.

+
+

R +

+
+ggplot(gapminder, aes(x = year, y = pop_millions)) +
+ geom_point()
+
+
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.
+

R +

+
+countryset <- c("China","India","Indonesia")
+ggplot(gapminder[gapminder$country %in% countryset,],
+       aes(x = year, y = pop_millions)) +
+  geom_point()
+
+
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.
+
+
+
+

Comparison operators, logical operators, and many functions are also +vectorized:

+

Comparison operators

+
+

R +

+
+x > 2
+
+
+

OUTPUT +

+
[1] FALSE FALSE  TRUE  TRUE
+
+

Logical operators

+
+

R +

+
+a <- x > 3  # or, for clarity, a <- (x > 3)
+a
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE  TRUE
+
+
+
+ +
+
+

Tip: some useful functions for logical +vectors +

+
+

any() will return TRUE if any +element of a vector is TRUE.
all() will return TRUE if all +elements of a vector are TRUE.

+
+
+
+

Most functions also operate element-wise on vectors:

+

Functions

+
+

R +

+
+x <- 1:4
+log(x)
+
+
+

OUTPUT +

+
[1] 0.0000000 0.6931472 1.0986123 1.3862944
+
+

Vectorized operations work element-wise on matrices:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m * -1
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]   -1   -4   -7  -10
+[2,]   -2   -5   -8  -11
+[3,]   -3   -6   -9  -12
+
+
+
+ +
+
+

Tip: element-wise vs. matrix +multiplication +

+
+

Very important: the operator * gives you element-wise +multiplication! To do matrix multiplication, we need to use the +%*% operator:

+
+

R +

+
+m %*% matrix(1, nrow=4, ncol=1)
+
+
+

OUTPUT +

+
     [,1]
+[1,]   22
+[2,]   26
+[3,]   30
+
+
+

R +

+
+matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)
+
+
+

OUTPUT +

+
     [,1]
+[1,]   30
+
+

For more on matrix algebra, see the Quick-R +reference guide

+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Given the following matrix:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12
+
+

Write down what you think will happen when you run:

+
  1. m ^ -1
  2. +
  3. m * c(1, 0, -1)
  4. +
  5. m > c(0, 20)
  6. +
  7. m * c(1, 0, -1, 2)
  8. +

Did you get the output you expected? If not, ask a helper!

+
+
+
+
+
+ +
+
+

Given the following matrix:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12
+
+

Write down what you think will happen when you run:

+
  1. m ^ -1
  2. +
+

OUTPUT +

+
          [,1]      [,2]      [,3]       [,4]
+[1,] 1.0000000 0.2500000 0.1428571 0.10000000
+[2,] 0.5000000 0.2000000 0.1250000 0.09090909
+[3,] 0.3333333 0.1666667 0.1111111 0.08333333
+
+
  1. m * c(1, 0, -1)
  2. +
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    0    0    0    0
+[3,]   -3   -6   -9  -12
+
+
  1. m > c(0, 20)
  2. +
+

OUTPUT +

+
      [,1]  [,2]  [,3]  [,4]
+[1,]  TRUE FALSE  TRUE FALSE
+[2,] FALSE  TRUE FALSE  TRUE
+[3,]  TRUE FALSE  TRUE FALSE
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

We’re interested in looking at the sum of the following sequence of +fractions:

+
+

R +

+
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+
+

This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000?

+
+
+
+
+
+ +
+
+

We’re interested in looking at the sum of the following sequence of +fractions:

+
+

R +

+
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+
+

This would be tedious to type out, and impossible for high values of +n. Can you use vectorisation to compute x, when n=100? How about when +n=10,000?

+
+

R +

+
+sum(1/(1:100)^2)
+
+
+

OUTPUT +

+
[1] 1.634984
+
+
+

R +

+
+sum(1/(1:1e04)^2)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+

R +

+
+n <- 10000
+sum(1/(1:n)^2)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+

We can also obtain the same results using a function:

+
+

R +

+
+inverse_sum_of_squares <- function(n) {
+  sum(1/(1:n)^2)
+}
+inverse_sum_of_squares(100)
+
+
+

OUTPUT +

+
[1] 1.634984
+
+
+

R +

+
+inverse_sum_of_squares(10000)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+

R +

+
+n <- 10000
+inverse_sum_of_squares(n)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+
+
+
+
+
+ +
+
+

Tip: Operations on vectors of unequal +length +

+
+

Operations can also be performed on vectors of unequal length, +through a process known as recycling. This process +automatically repeats the smaller vector until it matches the length of +the larger vector. R will provide a warning if the larger vector is not +a multiple of the smaller vector.

+
+

R +

+
+x <- c(1, 2, 3)
+y <- c(1, 2, 3, 4, 5, 6, 7)
+x + y
+
+
+

WARNING +

+
Warning in x + y: longer object length is not a multiple of shorter object
+length
+
+
+

OUTPUT +

+
[1] 2 4 6 5 7 9 8
+
+

Vector x was recycled to match the length of vector +y

+
+

R +

+
x:  1  2  3  1  2  3  1
+    +  +  +  +  +  +  +
+y:  1  2  3  4  5  6  7
+-----------------------
+    2  4  6  5  7  9  8
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use vectorized operations instead of loops.
  • +
+
+
+ + + +
+
+ + +
+
+ + + diff --git a/instructor/10-functions.html b/instructor/10-functions.html new file mode 100644 index 000000000..c723aee66 --- /dev/null +++ b/instructor/10-functions.html @@ -0,0 +1,1222 @@ + +R for Reproducible Scientific Analysis: Functions Explained +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Functions Explained

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 60 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I write a new function in R?
  • +
+
+
+
+
+
+

Objectives

+
  • Define a function that takes arguments.
  • +
  • Return a value from a function.
  • +
  • Check argument conditions with stopifnot() in +functions.
  • +
  • Test a function.
  • +
  • Set default values for function arguments.
  • +
  • Explain why we should divide programs into small, single-purpose +functions.
  • +
+
+
+
+
+

If we only had one data set to analyze, it would probably be faster +to load the file into a spreadsheet and use that to plot simple +statistics. However, the gapminder data is updated periodically, and we +may want to pull in that new information later and re-run our analysis +again. We may also obtain similar data from a different source in the +future.

+

In this lesson, we’ll learn how to write a function so that we can +repeat several operations with a single command.

+
+
+ +
+
+

What is a function? +

+
+

Functions gather a sequence of operations into a whole, preserving it +for ongoing use. Functions provide:

+
  • a name we can remember and invoke it by
  • +
  • relief from the need to remember the individual operations
  • +
  • a defined set of inputs and expected outputs
  • +
  • rich connections to the larger programming environment
  • +

As the basic building block of most programming languages, +user-defined functions constitute “programming” as much as any single +abstraction can. If you have written a function, you are a computer +programmer.

+
+
+
+

Defining a function +

+

Let’s open a new R script file in the functions/ +directory and call it functions-lesson.R.

+

The general structure of a function is:

+
+

R +

+
+my_function <- function(parameters) {
+  # perform action
+  # return value
+}
+
+

Let’s define a function fahr_to_kelvin() that converts +temperatures from Fahrenheit to Kelvin:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

We define fahr_to_kelvin() by assigning it to the output +of function. The list of argument names are contained +within parentheses. Next, the body of +the function–the statements that are executed when it runs–is contained +within curly braces ({}). The statements in the body are +indented by two spaces. This makes the code easier to read but does not +affect how the code operates.

+

It is useful to think of creating functions like writing a cookbook. +First you define the “ingredients” that your function needs. In this +case, we only need one ingredient to use our function: “temp”. After we +list our ingredients, we then say what we will do with them, in this +case, we are taking our ingredient and applying a set of mathematical +operators to it.

+

When we call the function, the values we pass to it as arguments are +assigned to those variables so that we can use them inside the function. +Inside the function, we use a return statement to send a +result back to whoever asked for it.

+
+
+ +
+
+

Tip +

+
+

One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the +body of the function. But for clarity, we will explicitly define the +return statement.

+
+
+
+

Let’s try running our function. Calling our own function is no +different from calling any other function:

+
+

R +

+
+# freezing point of water
+fahr_to_kelvin(32)
+
+
+

OUTPUT +

+
[1] 273.15
+
+
+

R +

+
+# boiling point of water
+fahr_to_kelvin(212)
+
+
+

OUTPUT +

+
[1] 373.15
+
+
+
+ +
+
+

Challenge 1 +

+
+

Write a function called kelvin_to_celsius() that takes a +temperature in Kelvin and returns that temperature in Celsius.

+

Hint: To convert from Kelvin to Celsius you subtract 273.15

+
+
+
+
+
+ +
+
+

Write a function called kelvin_to_celsius that takes a +temperature in Kelvin and returns that temperature in Celsius

+
+

R +

+
+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)
+}
+
+
+
+
+
+

Combining functions +

+

The real power of functions comes from mixing, matching and combining +them into ever-larger chunks to get the effect we want.

+

Let’s define two functions that will convert temperature from +Fahrenheit to Kelvin, and Kelvin to Celsius:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+kelvin_to_celsius <- function(temp) {
+  celsius <- temp - 273.15
+  return(celsius)
+}
+
+
+
+ +
+
+

Challenge 2 +

+
+

Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer).

+
+
+
+
+
+ +
+
+

Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above

+
+

R +

+
+fahr_to_celsius <- function(temp) {
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)
+}
+
+
+
+
+
+

Interlude: Defensive Programming +

+

Now that we’ve begun to appreciate how writing functions provides an +efficient way to make R code re-usable and modular, we should note that +it is important to ensure that functions only work in their intended +use-cases. Checking function parameters is related to the concept of +defensive programming. Defensive programming encourages us to +frequently check conditions and throw an error if something is wrong. +These checks are referred to as assertion statements because we want to +assert some condition is TRUE before proceeding. They make +it easier to debug because they give us a better idea of where the +errors originate.

+
+

Checking conditions with stopifnot() +

+

Let’s start by re-examining fahr_to_kelvin(), our +function for converting temperatures from Fahrenheit to Kelvin. It was +defined like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

For this function to work as intended, the argument temp +must be a numeric value; otherwise, the mathematical +procedure for converting between the two temperature scales will not +work. To create an error, we can use the function stop(). +For example, since the argument temp must be a +numeric vector, we could check for this condition with an +if statement and throw an error if the condition was +violated. We could augment our function above like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  if (!is.numeric(temp)) {
+    stop("temp must be a numeric vector.")
+  }
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

If we had multiple conditions or arguments to check, it would take +many lines of code to check all of them. Luckily R provides the +convenience function stopifnot(). We can list as many +requirements that should evaluate to TRUE; +stopifnot() throws an error if it finds one that is +FALSE. Listing these conditions also serves a secondary +purpose as extra documentation for the function.

+

Let’s try out defensive programming with stopifnot() by +adding assertions to check the input to our function +fahr_to_kelvin().

+

We want to assert the following: temp is a numeric +vector. We may do that like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  stopifnot(is.numeric(temp))
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

It still works when given proper input.

+
+

R +

+
+# freezing point of water
+fahr_to_kelvin(temp = 32)
+
+
+

OUTPUT +

+
[1] 273.15
+
+

But fails instantly if given improper input.

+
+

R +

+
+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp = as.factor(32))
+
+
+

ERROR +

+
Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+
+
+
+ +
+
+

Challenge 3 +

+
+

Use defensive programming to ensure that our +fahr_to_celsius() function throws an error immediately if +the argument temp is specified inappropriately.

+
+
+
+
+
+ +
+
+

Extend our previous definition of the function by adding in an +explicit call to stopifnot(). Since +fahr_to_celsius() is a composition of two other functions, +checking inside here makes adding checks to the two component functions +redundant.

+
+

R +

+
+fahr_to_celsius <- function(temp) {
+  stopifnot(is.numeric(temp))
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)
+}
+
+
+
+
+
+
+

More on combining functions +

+

Now, we’re going to define a function that calculates the Gross +Domestic Product of a nation from the data available in our dataset:

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat) {
+  gdp <- dat$pop * dat$gdpPercap
+  return(gdp)
+}
+
+

We define calcGDP() by assigning it to the output of +function. The list of argument names are contained within +parentheses. Next, the body of the function -- the statements executed +when you call the function – is contained within curly braces +({}).

+

We’ve indented the statements in the body by two spaces. This makes +the code easier to read but does not affect how it operates.

+

When we call the function, the values we pass to it are assigned to +the arguments, which become variables inside the body of the +function.

+

Inside the function, we use the return() function to +send back the result. This return() function is optional: R +will automatically return the results of whatever command is executed on +the last line of the function.

+
+

R +

+
+calcGDP(head(gapminder))
+
+
+

OUTPUT +

+
[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231
+
+

That’s not very informative. Let’s add some more arguments so we can +extract that per year and country.

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

If you’ve been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by +using the source() function:

+
+

R +

+
+source("functions/functions-lesson.R")
+
+

Ok, so there’s a lot going on in this function now. In plain English, +the function now subsets the provided data by year if the year argument +isn’t empty, then subsets the result by country if the country argument +isn’t empty. Then it calculates the GDP for whatever subset emerges from +the previous two steps. The function then adds the GDP as a new column +to the subsetted data and returns this as the final result. You can see +that the output is much more informative than a vector of numbers.

+

Let’s take a look at what happens when we specify the year:

+
+

R +

+
+head(calcGDP(gapminder, year=2007))
+
+
+

OUTPUT +

+
       country year      pop continent lifeExp  gdpPercap          gdp
+12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
+24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
+36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
+48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
+60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
+72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894
+
+

Or for a specific country:

+
+

R +

+
+calcGDP(gapminder, country="Australia")
+
+
+

OUTPUT +

+
     country year      pop continent lifeExp gdpPercap          gdp
+61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
+62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
+63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
+64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
+65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
+66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
+67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
+68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
+69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
+70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
+71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894
+
+

Or both:

+
+

R +

+
+calcGDP(gapminder, year=2007, country="Australia")
+
+
+

OUTPUT +

+
     country year      pop continent lifeExp gdpPercap          gdp
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894
+
+

Let’s walk through the body of the function:

+
+

R +

+
calcGDP <- function(dat, year=NULL, country=NULL) {
+
+

Here we’ve added two arguments, year, and +country. We’ve set default arguments for both as +NULL using the = operator in the function +definition. This means that those arguments will take on those values +unless the user specifies otherwise.

+
+

R +

+
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+
+

Here, we check whether each additional argument is set to +null, and whenever they’re not null overwrite +the dataset stored in dat with a subset given by the +non-null argument.

+

Building these conditionals into the function makes it more flexible +for later. Now, we can use it to calculate the GDP for:

+
  • The whole dataset;
  • +
  • A single year;
  • +
  • A single country;
  • +
  • A single combination of year and country.
  • +

By using %in% instead, we can also give multiple years +or countries to those arguments.

+
+
+ +
+
+

Tip: Pass by value +

+
+

Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify dat inside the +function we are modifying the copy of the gapminder dataset stored in +dat, not the original variable we gave as the first +argument.

+

This is called “pass-by-value” and it makes writing code much safer: +you can always be sure that whatever changes you make within the body of +the function, stay inside the body of the function.

+
+
+
+
+
+ +
+
+

Tip: Function scope +

+
+

Another important concept is scoping: any variables (or functions!) +you create or modify inside the body of a function only exist for the +lifetime of the function’s execution. When we call +calcGDP(), the variables dat, gdp +and new only exist inside the body of the function. Even if +we have variables of the same name in our interactive R session, they +are not modified in any way when executing a function.

+
+
+
+
+

R +

+
  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

Finally, we calculated the GDP on our new subset, and created a new +data frame with that column added. This means when we call the function +later we can see the context for the returned GDP values, which is much +better than in our first attempt where we got a vector of numbers.

+
+
+ +
+
+

Challenge 4 +

+
+

Test out your GDP function by calculating the GDP for New Zealand in +1987. How does this differ from New Zealand’s GDP in 1952?

+
+
+
+
+
+ +
+
+
+

R +

+
+  calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand")
+
+

GDP for New Zealand in 1987: 65050008703

+

GDP for New Zealand in 1952: 21058193787

+
+
+
+
+
+
+ +
+
+

Challenge 5 +

+
+

The paste() function can be used to combine text +together, e.g:

+
+

R +

+
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+paste(best_practice, collapse=" ")
+
+
+

OUTPUT +

+
[1] "Write programs for people not computers"
+
+

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:

+
+

R +

+
+fence(text=best_practice, wrapper="***")
+
+

Note: the paste() function has an argument +called sep, which specifies the separator between text. The +default is a space: ” “. The default for paste0() is no +space”“.

+
+
+
+
+
+ +
+
+

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:

+
+

R +

+
+fence <- function(text, wrapper){
+  text <- c(wrapper, text, wrapper)
+  result <- paste(text, collapse = " ")
+  return(result)
+}
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+fence(text=best_practice, wrapper="***")
+
+
+

OUTPUT +

+
[1] "*** Write programs for people not computers ***"
+
+
+
+
+
+
+
+ +
+
+

Tip +

+
+

R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the R +Language Manual or this chapter from Advanced R Programming by Hadley +Wickham.

+
+
+
+
+
+ +
+
+

Tip: Testing and documenting +

+
+

It’s important to both test functions and document them: +Documentation helps you, and others, understand what the purpose of your +function is, and how to use it, and its important to make sure that your +function actually does what you think.

+

When you first start out, your workflow will probably look a lot like +this:

+
  1. Write a function
  2. +
  3. Comment parts of the function to document its behaviour
  4. +
  5. Load in the source file
  6. +
  7. Experiment with it in the console to make sure it behaves as you +expect
  8. +
  9. Make any necessary bug fixes
  10. +
  11. Rinse and repeat.
  12. +

Formal documentation for functions, written in separate +.Rd files, gets turned into the documentation you see in +help files. The roxygen2 +package allows R coders to write documentation alongside the function +code and then process it into the appropriate .Rd files. +You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In +fact, packages are, in essence, bundles of functions with this formal +documentation. Loading your own functions through +source("functions.R") is equivalent to loading someone +else’s functions (or your own one day!) through +library("package").

+

Formal automated tests can be written using the testthat package.

+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/11-writing-data.html b/instructor/11-writing-data.html new file mode 100644 index 000000000..c536390e7 --- /dev/null +++ b/instructor/11-writing-data.html @@ -0,0 +1,688 @@ + +R for Reproducible Scientific Analysis: Writing Data +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Writing Data

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 20 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I save plots and data created in R?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to write out plots and data from R.
  • +
+
+
+
+
+

Saving plots +

+

You have already seen how to save the most recent plot you create in +ggplot2, using the command ggsave. As a +refresher:

+
+

R +

+
+ggsave("My_most_recent_plot.pdf")
+
+

You can save a plot from within RStudio using the ‘Export’ button in +the ‘Plot’ window. This will give you the option of saving as a .pdf or +as .png, .jpg or other image formats.

+

Sometimes you will want to save plots without creating them in the +‘Plot’ window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you’re looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can’t stop +the loop to click ‘Export’ for each one.

+

In this case you can use a more flexible approach. The function +pdf creates a new pdf device. You can control the size and +resolution using the arguments to this function.

+
+

R +

+
+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
+  geom_line() +
+  theme(legend.position = "none")
+
+# You then have to make sure to turn off the pdf device!
+
+dev.off()
+
+

Open up this document and have a look.

+
+
+ +
+
+

Challenge 1 +

+
+

Rewrite your ‘pdf’ command to print a second page in the pdf, showing +a facet plot (hint: use facet_grid) of the same data with +one panel per continent.

+
+
+
+
+
+ +
+
+
+

R +

+
+pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
+p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
+  geom_line() +
+  theme(legend.position = "none")
+p
+p + facet_grid(~continent)
+dev.off()
+
+
+
+
+
+

The commands jpeg, png etc. are used +similarly to produce documents in different formats.

+

Writing data +

+

At some point, you’ll also want to write out data from R.

+

We can use the write.table function for this, which is +very similar to read.table from before.

+

Let’s create a data-cleaning script, for this analysis, we only want +to focus on the gapminder data for Australia:

+
+

R +

+
+aust_subset <- gapminder[gapminder$country == "Australia",]
+
+write.table(aust_subset,
+  file="cleaned-data/gapminder-aus.csv",
+  sep=","
+)
+
+

Let’s switch back to the shell to take a look at the data to make +sure it looks OK:

+
+

BASH +

+
head cleaned-data/gapminder-aus.csv
+
+
+

OUTPUT +

+
"country","year","pop","continent","lifeExp","gdpPercap"
+"61","Australia",1952,8691212,"Oceania",69.12,10039.59564
+"62","Australia",1957,9712569,"Oceania",70.33,10949.64959
+"63","Australia",1962,10794968,"Oceania",70.93,12217.22686
+"64","Australia",1967,11872264,"Oceania",71.1,14526.12465
+"65","Australia",1972,13177000,"Oceania",71.93,16788.62948
+"66","Australia",1977,14074100,"Oceania",73.49,18334.19751
+"67","Australia",1982,15184200,"Oceania",74.74,19477.00928
+"68","Australia",1987,16257249,"Oceania",76.32,21888.88903
+"69","Australia",1992,17481977,"Oceania",77.56,23424.76683
+
+

Hmm, that’s not quite what we wanted. Where did all these quotation +marks come from? Also the row numbers are meaningless.

+

Let’s look at the help file to work out how to change this +behaviour.

+
+

R +

+
+?write.table
+
+

By default R will wrap character vectors with quotation marks when +writing out to file. It will also write out the row and column +names.

+

Let’s fix this:

+
+

R +

+
+write.table(
+  gapminder[gapminder$country == "Australia",],
+  file="cleaned-data/gapminder-aus.csv",
+  sep=",", quote=FALSE, row.names=FALSE
+)
+
+

Now lets look at the data again using our shell skills:

+
+

BASH +

+
head cleaned-data/gapminder-aus.csv
+
+
+

OUTPUT +

+
country,year,pop,continent,lifeExp,gdpPercap
+Australia,1952,8691212,Oceania,69.12,10039.59564
+Australia,1957,9712569,Oceania,70.33,10949.64959
+Australia,1962,10794968,Oceania,70.93,12217.22686
+Australia,1967,11872264,Oceania,71.1,14526.12465
+Australia,1972,13177000,Oceania,71.93,16788.62948
+Australia,1977,14074100,Oceania,73.49,18334.19751
+Australia,1982,15184200,Oceania,74.74,19477.00928
+Australia,1987,16257249,Oceania,76.32,21888.88903
+Australia,1992,17481977,Oceania,77.56,23424.76683
+
+

That looks better!

+
+
+ +
+
+

Challenge 2 +

+
+

Write a data-cleaning script file that subsets the gapminder data to +include only data points collected since 1990.

+

Use this script to write out the new subset to a file in the +cleaned-data/ directory.

+
+
+
+
+
+ +
+
+
+

R +

+
+write.table(
+  gapminder[gapminder$year > 1990, ],
+  file = "cleaned-data/gapminder-after1990.csv",
+  sep = ",", quote = FALSE, row.names = FALSE
+)
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/12-plyr.html b/instructor/12-plyr.html new file mode 100644 index 000000000..77fa8c1cf --- /dev/null +++ b/instructor/12-plyr.html @@ -0,0 +1,1012 @@ + +R for Reproducible Scientific Analysis: Splitting and Combining Data Frames with plyr +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Splitting and Combining Data Frames with plyr

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 60 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I do different calculations on different sets of data?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to use the split-apply-combine strategy for data +analysis.
  • +
+
+
+
+
+

Previously we looked at how you can use functions to simplify your +code. We defined the calcGDP function, which takes the +gapminder dataset, and multiplies the population and GDP per capita +column. We also defined additional arguments so we could filter by +year and country:

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

A common task you’ll encounter when working with data, is that you’ll +want to run calculations on different groups within the data. In the +above, we were calculating the GDP by multiplying two columns together. +But what if we wanted to calculated the mean GDP per continent?

+

We could run calcGDP and then take the mean of each +continent:

+
+

R +

+
+withGDP <- calcGDP(gapminder)
+mean(withGDP[withGDP$continent == "Africa", "gdp"])
+
+
+

OUTPUT +

+
[1] 20904782844
+
+
+

R +

+
+mean(withGDP[withGDP$continent == "Americas", "gdp"])
+
+
+

OUTPUT +

+
[1] 379262350210
+
+
+

R +

+
+mean(withGDP[withGDP$continent == "Asia", "gdp"])
+
+
+

OUTPUT +

+
[1] 227233738153
+
+

But this isn’t very nice. Yes, by using a function, you have +reduced a substantial amount of repetition. That is +nice. But there is still repetition. Repeating yourself will cost you +time, both now and later, and potentially introduce some nasty bugs.

+

We could write a new function that is flexible like +calcGDP, but this also takes a substantial amount of effort +and testing to get right.

+

The abstract problem we’re encountering here is know as +“split-apply-combine”:

+
Split apply combine

We want to split our data into groups, in this case +continents, apply some calculations on that group, then +optionally combine the results together afterwards.

+

The plyr package +

+

For those of you who have used R before, you might be familiar with +the apply family of functions. While R’s built in functions +do work, we’re going to introduce you to another method for solving the +“split-apply-combine” problem. The plyr package provides a set of +functions that we find more user friendly for solving this problem.

+

We installed this package in an earlier challenge. Let us load it +now:

+
+

R +

+
+library("plyr")
+
+

Plyr has functions for operating on lists, +data.frames and arrays (matrices, or +n-dimensional vectors). Each function performs:

+
  1. A splitting operation
  2. +
  3. +Apply a function on each split in turn.
  4. +
  5. Recombine output data as a single data object.
  6. +

The functions are named based on the data structure they expect as +input, and the data structure you want returned as output: [a]rray, +[l]ist, or [d]ata.frame. The first letter corresponds to the input data +structure, the second letter to the output data structure, and then the +rest of the function is named “ply”.

+

This gives us 9 core functions **ply. There are an additional three +functions which will only perform the split and apply steps, and not any +combine step. They’re named by their input data type and represent null +output by a _ (see table)

+

Note here that plyr’s use of “array” is different to R’s, an array in +ply can include a vector or matrix.

+
Full apply suite

Each of the xxply functions (daply, ddply, +llply, laply, …) has the same structure and +has 4 key features and structure:

+
+

R +

+
+xxply(.data, .variables, .fun)
+
+
  • The first letter of the function name gives the input type and the +second gives the output type.
  • +
  • .data - gives the data object to be processed
  • +
  • .variables - identifies the splitting variables
  • +
  • .fun - gives the function to be called on each piece
  • +

Now we can quickly calculate the mean GDP per continent:

+
+

R +

+
+ddply(
+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
  continent           V1
+1    Africa  20904782844
+2  Americas 379262350210
+3      Asia 227233738153
+4    Europe 269442085301
+5   Oceania 188187105354
+
+

Let us walk through the previous code:

+
  • The ddply function feeds in a data.frame +(function starts with d) and returns another +data.frame (2nd letter is a d)
  • +
  • the first argument we gave was the data.frame we wanted to operate +on: in this case the gapminder data. We called calcGDP on +it first so that it would have the additional gdp column +added to it.
  • +
  • The second argument indicated our split criteria: in this case the +“continent” column. Note that we gave the name of the column, not the +values of the column like we had done previously with subsetting. Plyr +takes care of these implementation details for you.
  • +
  • The third argument is the function we want to apply to each grouping +of the data. We had to define our own short function here: each subset +of the data gets stored in x, the first argument of our +function. This is an anonymous function: we haven’t defined it +elsewhere, and it has no name. It only exists in the scope of our call +to ddply.
  • +
+
+ +
+
+

Challenge 1 +

+
+

Calculate the average life expectancy per continent. Which has the +longest? Which has the shortest?

+
+
+
+
+
+ +
+
+
+

R +

+
+ddply(
+ .data = gapminder,
+ .variables = "continent",
+ .fun = function(x) mean(x$lifeExp)
+)
+
+

Oceania has the longest and Africa the shortest.

+
+
+
+
+

What if we want a different type of output data structure?:

+
+

R +

+
+dlply(
+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
$Africa
+[1] 20904782844
+
+$Americas
+[1] 379262350210
+
+$Asia
+[1] 227233738153
+
+$Europe
+[1] 269442085301
+
+$Oceania
+[1] 188187105354
+
+attr(,"split_type")
+[1] "data.frame"
+attr(,"split_labels")
+  continent
+1    Africa
+2  Americas
+3      Asia
+4    Europe
+5   Oceania
+
+

We called the same function again, but changed the second letter to +an l, so the output was returned as a list.

+

We can specify multiple columns to group by:

+
+

R +

+
+ddply(
+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
   continent year           V1
+1     Africa 1952   5992294608
+2     Africa 1957   7359188796
+3     Africa 1962   8784876958
+4     Africa 1967  11443994101
+5     Africa 1972  15072241974
+6     Africa 1977  18694898732
+7     Africa 1982  22040401045
+8     Africa 1987  24107264108
+9     Africa 1992  26256977719
+10    Africa 1997  30023173824
+11    Africa 2002  35303511424
+12    Africa 2007  45778570846
+13  Americas 1952 117738997171
+14  Americas 1957 140817061264
+15  Americas 1962 169153069442
+16  Americas 1967 217867530844
+17  Americas 1972 268159178814
+18  Americas 1977 324085389022
+19  Americas 1982 363314008350
+20  Americas 1987 439447790357
+21  Americas 1992 489899820623
+22  Americas 1997 582693307146
+23  Americas 2002 661248623419
+24  Americas 2007 776723426068
+25      Asia 1952  34095762661
+26      Asia 1957  47267432088
+27      Asia 1962  60136869012
+28      Asia 1967  84648519224
+29      Asia 1972 124385747313
+30      Asia 1977 159802590186
+31      Asia 1982 194429049919
+32      Asia 1987 241784763369
+33      Asia 1992 307100497486
+34      Asia 1997 387597655323
+35      Asia 2002 458042336179
+36      Asia 2007 627513635079
+37    Europe 1952  84971341466
+38    Europe 1957 109989505140
+39    Europe 1962 138984693095
+40    Europe 1967 173366641137
+41    Europe 1972 218691462733
+42    Europe 1977 255367522034
+43    Europe 1982 279484077072
+44    Europe 1987 316507473546
+45    Europe 1992 342703247405
+46    Europe 1997 383606933833
+47    Europe 2002 436448815097
+48    Europe 2007 493183311052
+49   Oceania 1952  54157223944
+50   Oceania 1957  66826828013
+51   Oceania 1962  82336453245
+52   Oceania 1967 105958863585
+53   Oceania 1972 134112109227
+54   Oceania 1977 154707711162
+55   Oceania 1982 176177151380
+56   Oceania 1987 209451563998
+57   Oceania 1992 236319179826
+58   Oceania 1997 289304255183
+59   Oceania 2002 345236880176
+60   Oceania 2007 403657044512
+
+
+

R +

+
+daply(
+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
          year
+continent          1952         1957         1962         1967         1972
+  Africa     5992294608   7359188796   8784876958  11443994101  15072241974
+  Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+  Asia      34095762661  47267432088  60136869012  84648519224 124385747313
+  Europe    84971341466 109989505140 138984693095 173366641137 218691462733
+  Oceania   54157223944  66826828013  82336453245 105958863585 134112109227
+          year
+continent          1977         1982         1987         1992         1997
+  Africa    18694898732  22040401045  24107264108  26256977719  30023173824
+  Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+  Asia     159802590186 194429049919 241784763369 307100497486 387597655323
+  Europe   255367522034 279484077072 316507473546 342703247405 383606933833
+  Oceania  154707711162 176177151380 209451563998 236319179826 289304255183
+          year
+continent          2002         2007
+  Africa    35303511424  45778570846
+  Americas 661248623419 776723426068
+  Asia     458042336179 627513635079
+  Europe   436448815097 493183311052
+  Oceania  345236880176 403657044512
+
+

You can use these functions in place of for loops (and +it is usually faster to do so). To replace a for loop, put the code that +was in the body of the for loop inside an anonymous +function.

+
+

R +

+
+d_ply(
+  .data=gapminder,
+  .variables = "continent",
+  .fun = function(x) {
+    meanGDPperCap <- mean(x$gdpPercap)
+    print(paste(
+      "The mean GDP per capita for", unique(x$continent),
+      "is", format(meanGDPperCap, big.mark=",")
+   ))
+  }
+)
+
+
+

OUTPUT +

+
[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+
+
+
+ +
+
+

Tip: printing numbers +

+
+

The format function can be used to make numeric values +“pretty” for printing out in messages.

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Calculate the average life expectancy per continent and year. Which +had the longest and shortest in 2007? Which had the greatest change in +between 1952 and 2007?

+
+
+
+
+
+ +
+
+
+

R +

+
+solution <- ddply(
+ .data = gapminder,
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$lifeExp)
+)
+solution_2007 <- solution[solution$year == 2007, ]
+solution_2007
+
+

Oceania had the longest average life expectancy in 2007 and Africa +the lowest.

+
+

R +

+
+solution_1952_2007 <- cbind(solution[solution$year == 1952, ], solution_2007)
+difference_1952_2007 <- data.frame(continent = solution_1952_2007$continent,
+                                   year_1957 = solution_1952_2007[[3]],
+                                   year_2007 = solution_1952_2007[[6]],
+                                   difference = solution_1952_2007[[6]] - solution_1952_2007[[3]])
+difference_1952_2007
+
+

Asia had the greatest difference, and Oceania the least.

+
+
+
+
+
+
+ +
+
+

Alternate Challenge +

+
+

Without running them, which of the following will calculate the +average life expectancy per continent:

+
  1. +
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = gapminder$continent,
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
  1. +
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = mean(dataGroup$lifeExp)
+)
+
+
  1. +
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
  1. +
+

R +

+
+adply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
+
+
+
+
+ +
+
+

Answer 3 will calculate the average life expectancy per +continent.

+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/13-dplyr.html b/instructor/13-dplyr.html new file mode 100644 index 000000000..048694649 --- /dev/null +++ b/instructor/13-dplyr.html @@ -0,0 +1,1240 @@ + +R for Reproducible Scientific Analysis: Data Frame Manipulation with dplyr +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Data Frame Manipulation with dplyr

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 55 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I manipulate data frames without repeating myself?
  • +
+
+
+
+
+
+

Objectives

+
  • To be able to use the six main data frame manipulation ‘verbs’ with +pipes in dplyr.
  • +
  • To understand how group_by() and +summarize() can be combined to summarize datasets.
  • +
  • Be able to analyze a subset of data using logical filtering.
  • +
+
+
+
+
+

Manipulation of data frames means many things to many researchers: we +often select certain observations (rows) or variables (columns), we +often group the data by a certain variable(s), or we even calculate +summary statistics. We can do these operations using the normal base R +operations:

+
+

R +

+
+mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 2193.755
+
+
+

R +

+
+mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 7136.11
+
+
+

R +

+
+mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 7902.15
+
+

But this isn’t very nice because there is a fair bit of +repetition. Repeating yourself will cost you time, both now and later, +and potentially introduce some nasty bugs.

+

The dplyr package +

+

Luckily, the dplyr +package provides a number of very useful functions for manipulating data +frames in a way that will reduce the above repetition, reduce the +probability of making errors, and probably even save you some typing. As +an added bonus, you might even find the dplyr grammar +easier to read.

+
+
+ +
+
+

Tip: Tidyverse +

+
+

dplyr package belongs to a broader family of opinionated +R packages designed for data science called the “Tidyverse”. These +packages are specifically designed to work harmoniously together. Some +of these packages will be covered along this course, but you can find +more complete information here: https://www.tidyverse.org/.

+
+
+
+

Here we’re going to cover 5 of the most commonly used functions as +well as using pipes (%>%) to combine them.

+
  1. select()
  2. +
  3. filter()
  4. +
  5. group_by()
  6. +
  7. summarize()
  8. +
  9. mutate()
  10. +

If you have have not installed this package earlier, please do +so:

+
+

R +

+
+install.packages('dplyr')
+
+

Now let’s load the package:

+
+

R +

+
+library("dplyr")
+
+

Using select() +

+

If, for example, we wanted to move forward with only a few of the +variables in our data frame we could use the select() +function. This will keep only the variables you select.

+
+

R +

+
+year_country_gdp <- select(gapminder, year, country, gdpPercap)
+
+

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.

+
+

R +

+
+smaller_gapminder_data <- select(gapminder, -continent)
+
+

If we open up year_country_gdp we’ll see that it only +contains the year, country and gdpPercap. Above we used ‘normal’ +grammar, but the strengths of dplyr lie in combining +several functions using pipes. Since the pipes grammar is unlike +anything we’ve seen in R before, let’s repeat what we’ve done above +using pipes.

+
+

R +

+
+year_country_gdp <- gapminder %>% select(year, country, gdpPercap)
+
+

To help you understand why we wrote that in that way, let’s walk +through it step by step. First we summon the gapminder data frame and +pass it on, using the pipe symbol %>%, to the next step, +which is the select() function. In this case we don’t +specify which data object we use in the select() function +since in gets that from the previous pipe. Fun Fact: +There is a good chance you have encountered pipes before in the shell. +In R, a pipe symbol is %>% while in the shell it is +| but the concept is the same!

+
+
+ +
+
+

Tip: Renaming data frame columns in dplyr +

+
+

In Chapter 4 we covered how you can rename columns with base R by +assigning a value to the output of the names() function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a +rename() function.

+

Within a pipeline, the syntax is +rename(new_name = old_name). For example, we may want to +rename the gdpPercap column name from our select() +statement above.

+
+

R +

+
+tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)
+
+head(tidy_gdp)
+
+
+

OUTPUT +

+
  year     country gdp_per_capita
+1 1952 Afghanistan       779.4453
+2 1957 Afghanistan       820.8530
+3 1962 Afghanistan       853.1007
+4 1967 Afghanistan       836.1971
+5 1972 Afghanistan       739.9811
+6 1977 Afghanistan       786.1134
+
+
+
+
+

Using filter() +

+

If we now want to move forward with the above, but only with European +countries, we can combine select and +filter

+
+

R +

+
+year_country_gdp_euro <- gapminder %>%
+    filter(continent == "Europe") %>%
+    select(year, country, gdpPercap)
+
+

If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below.

+
+

R +

+
+europe_lifeExp_2007 <- gapminder %>%
+  filter(continent == "Europe", year == 2007) %>%
+  select(country, lifeExp)
+
+
+
+ +
+
+

Challenge 1 +

+
+

Write a single command (which can span multiple lines and includes +pipes) that will produce a data frame that has the African values for +lifeExp, country and year, but +not for other Continents. How many rows does your data frame have and +why?

+
+
+
+
+
+ +
+
+
+

R +

+
+year_country_lifeExp_Africa <- gapminder %>%
+                           filter(continent == "Africa") %>%
+                           select(year, country, lifeExp)
+
+
+
+
+
+

As with last time, first we pass the gapminder data frame to the +filter() function, then we pass the filtered version of the +gapminder data frame to the select() function. +Note: The order of operations is very important in this +case. If we used ‘select’ first, filter would not be able to find the +variable continent since we would have removed it in the previous +step.

+

Using group_by() +

+

Now, we were supposed to be reducing the error prone repetitiveness +of what can be done with base R, but up to now we haven’t done that +since we would have to repeat the above for each continent. Instead of +filter(), which will only pass observations that meet your +criteria (in the above: continent=="Europe"), we can use +group_by(), which will essentially use every unique +criteria that you could have used in filter.

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+
+

R +

+
+str(gapminder %>% group_by(continent))
+
+
+

OUTPUT +

+
gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
+  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
+  ..$ .rows    : list<int> [1:5] 
+  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
+  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
+  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
+  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
+  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
+  .. ..@ ptype: int(0) 
+  ..- attr(*, ".drop")= logi TRUE
+
+

You will notice that the structure of the data frame where we used +group_by() (grouped_df) is not the same as the +original gapminder (data.frame). A +grouped_df can be thought of as a list where +each item in the listis a data.frame which +contains only the rows that correspond to the a particular value +continent (at least in the example above).

+
Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize() +

+

The above was a bit on the uneventful side but +group_by() is much more exciting in conjunction with +summarize(). This will allow us to create new variable(s) +by using functions that repeat for each of the continent-specific data +frames. That is to say, using the group_by() function, we +split our original data frame into multiple pieces, then we can run +functions (e.g. mean() or sd()) within +summarize().

+
+

R +

+
+gdp_bycontinents <- gapminder %>%
+    group_by(continent) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
+
+
Diagram illustrating the use of group by and summarize together to create a new variable
+

R +

+
continent mean_gdpPercap
+     <fctr>          <dbl>
+1    Africa       2193.755
+2  Americas       7136.110
+3      Asia       7902.150
+4    Europe      14469.476
+5   Oceania      18621.609
+
+

That allowed us to calculate the mean gdpPercap for each continent, +but it gets even better.

+
+
+ +
+
+

Challenge 2 +

+
+

Calculate the average life expectancy per country. Which has the +longest average life expectancy and which has the shortest average life +expectancy?

+
+
+
+
+
+ +
+
+
+

R +

+
+lifeExp_bycountry <- gapminder %>%
+   group_by(country) %>%
+   summarize(mean_lifeExp = mean(lifeExp))
+lifeExp_bycountry %>%
+   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
+
+
+

OUTPUT +

+
# A tibble: 2 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Iceland              76.5
+2 Sierra Leone         36.8
+
+

Another way to do this is to use the dplyr function +arrange(), which arranges the rows in a data frame +according to the order of one or more variables from the data frame. It +has similar syntax to other functions from the dplyr +package. You can use desc() inside arrange() +to sort in descending order.

+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(mean_lifeExp) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Sierra Leone         36.8
+
+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(desc(mean_lifeExp)) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country mean_lifeExp
+  <chr>          <dbl>
+1 Iceland         76.5
+
+

Alphabetical order works too

+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(desc(country)) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country  mean_lifeExp
+  <chr>           <dbl>
+1 Zimbabwe         52.7
+
+
+
+
+
+

The function group_by() allows us to group by multiple +variables. Let’s group by year and +continent.

+
+

R +

+
+gdp_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

That is already quite powerful, but it gets even better! You’re not +limited to defining 1 new variable in summarize().

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

count() and n() +

+

A very common operation is to count the number of observations for +each group. The dplyr package comes with two related +functions that help with this.

+

For instance, if we wanted to check the number of countries included +in the dataset for the year 2002, we can use the count() +function. It takes the name of one or more columns that contain the +groups we are interested in, and we can optionally sort the results in +descending order by adding sort=TRUE:

+
+

R +

+
+gapminder %>%
+    filter(year == 2002) %>%
+    count(continent, sort = TRUE)
+
+
+

OUTPUT +

+
  continent  n
+1    Africa 52
+2      Asia 33
+3    Europe 30
+4  Americas 25
+5   Oceania  2
+
+

If we need to use the number of observations in calculations, the +n() function is useful. It will return the total number of +observations in the current group rather than counting the number of +observations in each group within a specific column. For instance, if we +wanted to get the standard error of the life expectency per +continent:

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    summarize(se_le = sd(lifeExp)/sqrt(n()))
+
+
+

OUTPUT +

+
# A tibble: 5 × 2
+  continent se_le
+  <chr>     <dbl>
+1 Africa    0.366
+2 Americas  0.540
+3 Asia      0.596
+4 Europe    0.286
+5 Oceania   0.775
+
+

You can also chain together several summary operations; in this case +calculating the minimum, maximum, +mean and se of each continent’s per-country +life-expectancy:

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    summarize(
+      mean_le = mean(lifeExp),
+      min_le = min(lifeExp),
+      max_le = max(lifeExp),
+      se_le = sd(lifeExp)/sqrt(n()))
+
+
+

OUTPUT +

+
# A tibble: 5 × 5
+  continent mean_le min_le max_le se_le
+  <chr>       <dbl>  <dbl>  <dbl> <dbl>
+1 Africa       48.9   23.6   76.4 0.366
+2 Americas     64.7   37.6   80.7 0.540
+3 Asia         60.1   28.8   82.6 0.596
+4 Europe       71.9   43.6   81.8 0.286
+5 Oceania      74.3   69.1   81.2 0.775
+
+

Using mutate() +

+

We can also create new variables prior to (or even after) summarizing +information using mutate().

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
+    group_by(continent,year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

Connect mutate with logical filtering: ifelse +

+

When creating new variables, we can hook this with a logical +condition. A simple combination of mutate() and +ifelse() facilitates filtering right where it is needed: in +the moment of creating something new. This easy-to-read statement is a +fast and powerful way of discarding certain data (even though the +overall dimension of the data frame will not change) or for updating +values depending on this given condition.

+
+

R +

+
+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
+    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+

R +

+
+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
+    mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              mean_gdpPercap_expected = mean(gdp_futureExpectation))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

Combining dplyr and ggplot2 +

+

First install and load ggplot2:

+
+

R +

+
+install.packages('ggplot2')
+
+
+

R +

+
+library("ggplot2")
+
+

In the plotting lesson we looked at how to make a multi-panel figure +by adding a layer of facet panels using ggplot2. Here is +the code we used (with some extra comments):

+
+

R +

+
+# Filter countries located in the Americas
+americas <- gapminder[gapminder$continent == "Americas", ]
+# Make the plot
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

This code makes the right plot but it also creates an intermediate +variable (americas) that we might not have any other uses +for. Just as we used %>% to pipe data along a chain of +dplyr functions we can use it to pass data to +ggplot(). Because %>% replaces the first +argument in a function we don’t need to specify the data = +argument in the ggplot() function. By combining +dplyr and ggplot2 functions we can make the +same figure without creating any new variables or modifying the +data.

+
+

R +

+
+gapminder %>%
+  # Filter countries located in the Americas
+  filter(continent == "Americas") %>%
+  # Make the plot
+  ggplot(mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

More examples of using the function mutate() and the +ggplot2 package.

+
+

R +

+
+gapminder %>%
+  # extract first letter of country name into new column
+  mutate(startsWith = substr(country, 1, 1)) %>%
+  # only keep countries starting with A or Z
+  filter(startsWith %in% c("A", "Z")) %>%
+  # plot lifeExp into facets
+  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
+  geom_line() +
+  facet_wrap(vars(country)) +
+  theme_minimal()
+
+
+
+ +
+
+

Advanced Challenge +

+
+

Calculate the average life expectancy in 2002 of 2 randomly selected +countries for each continent. Then arrange the continent names in +reverse order. Hint: Use the dplyr +functions arrange() and sample_n(), they have +similar syntax to other dplyr functions.

+
+
+
+
+
+ +
+
+
+

R +

+
+lifeExp_2countries_bycontinents <- gapminder %>%
+   filter(year==2002) %>%
+   group_by(continent) %>%
+   sample_n(2) %>%
+   summarize(mean_lifeExp=mean(lifeExp)) %>%
+   arrange(desc(mean_lifeExp))
+
+
+
+
+
+

Other great resources +

+
+
+ +
+
+

Keypoints +

+
+
  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/14-tidyr.html b/instructor/14-tidyr.html new file mode 100644 index 000000000..1b636a826 --- /dev/null +++ b/instructor/14-tidyr.html @@ -0,0 +1,1161 @@ + +R for Reproducible Scientific Analysis: Data Frame Manipulation with tidyr +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Data Frame Manipulation with tidyr

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 45 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I change the layout of a data frame?
  • +
+
+
+
+
+
+

Objectives

+
  • To understand the concepts of ‘longer’ and ‘wider’ data frame +formats and be able to convert between them with +tidyr.
  • +
+
+
+
+
+

Researchers often want to reshape their data frames from ‘wide’ to +‘longer’ layouts, or vice-versa. The ‘long’ layout or format is +where:

+
  • each column is a variable
  • +
  • each row is an observation
  • +

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column +for the observed variable and the other columns are ID variables.

+

For the ‘wide’ format each row is often a site/subject/patient and +you have multiple observation variables containing the same type of +data. These can be either repeated observations over time, or +observation of multiple variables (or a mix of both). You may find data +input may be simpler or some other applications may prefer the ‘wide’ +format. However, many of R‘s functions have been designed +assuming you have ’longer’ formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format.

+
Diagram illustrating the difference between a wide versus long layout of a data frame

Long and wide data frame layouts mainly affect readability. For +humans, the wide format is often more intuitive since we can often see +more of the data on the screen due to its shape. However, the long +format is more machine readable and is closer to the formatting of +databases. The ID variables in our data frames are similar to the fields +in a database and observed variables are like the database values.

+

Getting started +

+

First install the packages if you haven’t already done so (you +probably installed dplyr in the previous lesson):

+
+

R +

+
+#install.packages("tidyr")
+#install.packages("dplyr")
+
+

Load the packages

+
+

R +

+
+library("tidyr")
+library("dplyr")
+
+

First, lets look at the structure of our original gapminder data +frame:

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+
+
+ +
+
+

Challenge 1 +

+
+

Is gapminder a purely long, purely wide, or some intermediate +format?

+
+
+
+
+
+ +
+
+

The original gapminder data.frame is in an intermediate format. It is +not purely long since it had multiple observation variables +(pop,lifeExp,gdpPercap).

+
+
+
+
+

Sometimes, as with the gapminder dataset, we have multiple types of +observed data. It is somewhere in between the purely ‘long’ and ‘wide’ +data formats. We have 3 “ID variables” (continent, +country, year) and 3 “Observation variables” +(pop,lifeExp,gdpPercap). This +intermediate format can be preferred despite not having ALL observations +in 1 column given that all 3 observation variables have different units. +There are few operations that would need us to make this data frame any +longer (i.e. 4 ID variables and 1 Observation variable).

+

While using many of the functions in R, which are often vector based, +you usually do not want to do mathematical operations on values with +different units. For example, using the purely long format, a single +mean for all of the values of population, life expectancy, and GDP would +not be meaningful since it would return the mean of values with 3 +incompatible units. The solution is that we first manipulate the data +either by grouping (see the lesson on dplyr), or we change +the structure of the data frame. Note: Some plotting +functions in R actually work better in the wide format data.

+

From wide to long format with pivot_longer() +

+

Until now, we’ve been using the nicely formatted original gapminder +dataset, but ‘real’ data (i.e. our own research data) will never be so +well organized. Here let’s start with the wide formatted version of the +gapminder dataset.

+
+

Download the wide version of the gapminder data from here and save it in your data +folder.

+
+

We’ll load the data file and look at it. Note: we don’t want our +continent and country columns to be factors, so we use the +stringsAsFactors argument for read.csv() to disable +that.

+
+

R +

+
+gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
+str(gap_wide)
+
+
+

OUTPUT +

+
'data.frame':	142 obs. of  38 variables:
+ $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num  3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num  2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
+ $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
+
+
Diagram illustrating the wide format of the gapminder data frame

To change this very wide data frame layout back to our nice, +intermediate (or longer) layout, we will use one of the two available +pivot functions from the tidyr package. To +convert from wide to a longer format, we will use the +pivot_longer() function. pivot_longer() makes +datasets longer by increasing the number of rows and decreasing the +number of columns, or ‘lengthening’ your observation variables into a +single variable.

+
Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format
+

R +

+
+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+    names_to = "obstype_year", values_to = "obs_values"
+  )
+str(gap_long)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
+ $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...
+
+

Here we have used piping syntax which is similar to what we were +doing in the previous lesson with dplyr. In fact, these are compatible +and you can use a mix of tidyr and dplyr functions by piping them +together.

+

We first provide to pivot_longer() a vector of column +names that will be pivoted into longer format. We could type out all the +observation variables, but as in the select() function (see +dplyr lesson), we can use the starts_with() +argument to select all variables that start with the desired character +string. pivot_longer() also allows the alternative syntax +of using the - symbol to identify which variables are not +to be pivoted (i.e. ID variables).

+

The next arguments to pivot_longer() are +names_to for naming the column that will contain the new ID +variable (obstype_year) and values_to for +naming the new amalgamated observation variable +(obs_value). We supply these new column names as +strings.

+
Diagram illustrating the long format of the gapminder data
+

R +

+
+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(-continent, -country),
+    names_to = "obstype_year", values_to = "obs_values"
+  )
+str(gap_long)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values  : num [1:5112] 2449 3014 2551 3247 4183 ...
+
+

That may seem trivial with this particular data frame, but sometimes +you have 1 ID variable and 40 observation variables with irregular +variable names. The flexibility is a huge time saver!

+

Now obstype_year actually contains 2 pieces of +information, the observation type +(pop,lifeExp, or gdpPercap) and +the year. We can use the separate() function +to split the character strings into multiple variables

+
+

R +

+
+gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
+gap_long$year <- as.integer(gap_long$year)
+
+
+
+ +
+
+

Challenge 2 +

+
+

Using gap_long, calculate the mean life expectancy, +population, and gdpPercap for each continent. Hint: use +the group_by() and summarize() functions we +learned in the dplyr lesson

+
+
+
+
+
+ +
+
+
+

R +

+
+gap_long %>% group_by(continent, obs_type) %>%
+   summarize(means=mean(obs_values))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+

OUTPUT +

+
# A tibble: 15 × 3
+# Groups:   continent [5]
+   continent obs_type       means
+   <chr>     <chr>          <dbl>
+ 1 Africa    gdpPercap     2194. 
+ 2 Africa    lifeExp         48.9
+ 3 Africa    pop        9916003. 
+ 4 Americas  gdpPercap     7136. 
+ 5 Americas  lifeExp         64.7
+ 6 Americas  pop       24504795. 
+ 7 Asia      gdpPercap     7902. 
+ 8 Asia      lifeExp         60.1
+ 9 Asia      pop       77038722. 
+10 Europe    gdpPercap    14469. 
+11 Europe    lifeExp         71.9
+12 Europe    pop       17169765. 
+13 Oceania   gdpPercap    18622. 
+14 Oceania   lifeExp         74.3
+15 Oceania   pop        8874672. 
+
+
+
+
+
+

From long to intermediate format with pivot_wider() +

+

It is always good to check work. So, let’s use the second +pivot function, pivot_wider(), to ‘widen’ our +observation variables back out. pivot_wider() is the +opposite of pivot_longer(), making a dataset wider by +increasing the number of columns and decreasing the number of rows. We +can use pivot_wider() to pivot or reshape our +gap_long to the original intermediate format or the widest +format. Let’s start with the intermediate format.

+

The pivot_wider() function takes names_from +and values_from arguments.

+

To names_from we supply the column name whose contents +will be pivoted into new output columns in the widened data frame. The +corresponding values will be added from the column named in the +values_from argument.

+
+

R +

+
+gap_normal <- gap_long %>%
+  pivot_wider(names_from = obs_type, values_from = obs_values)
+dim(gap_normal)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+
+

R +

+
+dim(gapminder)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+
+

R +

+
+names(gap_normal)
+
+
+

OUTPUT +

+
[1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"      
+
+
+

R +

+
+names(gapminder)
+
+
+

OUTPUT +

+
[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
+
+

Now we’ve got an intermediate data frame gap_normal with +the same dimensions as the original gapminder, but the +order of the variables is different. Let’s fix that before checking if +they are all.equal().

+
+

R +

+
+gap_normal <- gap_normal[, names(gapminder)]
+all.equal(gap_normal, gapminder)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+[3] "Component \"country\": 1704 string mismatches"                                         
+[4] "Component \"pop\": Mean relative difference: 1.634504"                                 
+[5] "Component \"continent\": 1212 string mismatches"                                       
+[6] "Component \"lifeExp\": Mean relative difference: 0.203822"                             
+[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"                           
+
+
+

R +

+
+head(gap_normal)
+
+
+

OUTPUT +

+
# A tibble: 6 × 6
+  country  year      pop continent lifeExp gdpPercap
+  <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>
+1 Algeria  1952  9279525 Africa       43.1     2449.
+2 Algeria  1957 10270856 Africa       45.7     3014.
+3 Algeria  1962 11000948 Africa       48.3     2551.
+4 Algeria  1967 12760499 Africa       51.4     3247.
+5 Algeria  1972 14760787 Africa       54.5     4183.
+6 Algeria  1977 17152804 Africa       58.0     4910.
+
+
+

R +

+
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+
+

We’re almost there, the original was sorted by country, +then year.

+
+

R +

+
+gap_normal <- gap_normal %>% arrange(country, year)
+all.equal(gap_normal, gapminder)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+
+

That’s great! We’ve gone from the longest format back to the +intermediate and we didn’t introduce any errors in our code.

+

Now let’s convert the long all the way back to the wide. In the wide +format, we will keep country and continent as ID variables and pivot the +observations across the 3 metrics +(pop,lifeExp,gdpPercap) and time +(year). First we need to create appropriate labels for all +our new variables (time*metric combinations) and we also need to unify +our ID variables to simplify the process of defining +gap_wide.

+
+

R +

+
+gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")
+str(gap_temp)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ var_ID    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ obs_type  : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
+ $ year      : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+
+
+

R +

+
+gap_temp <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_")
+str(gap_temp)
+
+
+

OUTPUT +

+
tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
+ $ ID_var    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+
+

Using unite() we now have a single ID variable which is +a combination of continent,country,and we have +defined variable names. We’re now ready to pipe in +pivot_wider()

+
+

R +

+
+gap_wide_new <- gap_long %>%
+  unite(ID_var, continent, country, sep = "_") %>%
+  unite(var_names, obs_type, year, sep = "_") %>%
+  pivot_wider(names_from = var_names, values_from = obs_values)
+str(gap_wide_new)
+
+
+

OUTPUT +

+
tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
+ $ ID_var        : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+
+
+
+ +
+
+

Challenge 3 +

+
+

Take this 1 step further and create a +gap_ludicrously_wide format data by pivoting over +countries, year and the 3 metrics? Hint this new data +frame should only have 5 rows.

+
+
+
+
+
+ +
+
+
+

R +

+
+gap_ludicrously_wide <- gap_long %>%
+   unite(var_names, obs_type, year, country, sep = "_") %>%
+   pivot_wider(names_from = var_names, values_from = obs_values)
+
+
+
+
+
+

Now we have a great ‘wide’ format data frame, but the +ID_var could be more usable, let’s separate it into 2 +variables with separate()

+
+

R +

+
+gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
+gap_wide_betterID <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_") %>%
+    pivot_wider(names_from = var_names, values_from = obs_values) %>%
+    separate(ID_var, c("continent","country"), sep = "_")
+str(gap_wide_betterID)
+
+
+

OUTPUT +

+
tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
+ $ continent     : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+
+
+

R +

+
+all.equal(gap_wide, gap_wide_betterID)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+
+

There and back again!

+

Other great resources +

+
+
+ +
+
+

Keypoints +

+
+
  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/15-knitr-markdown.html b/instructor/15-knitr-markdown.html new file mode 100644 index 000000000..a7c9df326 --- /dev/null +++ b/instructor/15-knitr-markdown.html @@ -0,0 +1,940 @@ + +R for Reproducible Scientific Analysis: Producing Reports With knitr +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Producing Reports With knitr

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 75 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I integrate software and reports?
  • +
+
+
+
+
+
+

Objectives

+
  • Understand the value of writing reproducible reports
  • +
  • Learn how to recognise and compile the basic components of an R +Markdown file
  • +
  • Become familiar with R code chunks, and understand their purpose, +structure and options
  • +
  • Demonstrate the use of inline chunks for weaving R outputs into text +blocks, for example when discussing the results of some +calculations
  • +
  • Be aware of alternative output formats to which an R Markdown file +can be exported
  • +
+
+
+
+
+

Data analysis reports +

+

Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their work +for future reference.

+

Many new users begin by first writing a single R script containing +all of their work, and then share the analysis by emailing the script +and various graphs as attachments. But this can be cumbersome, requiring +a lengthy discussion to explain which attachment was which result.

+

Writing formal reports with Word or LaTeX can simplify this +process by incorporating both the analysis report and output graphs into +a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy +“whack-a-mole” game of fixing new mistakes resulting from a single +formatting change.

+

Creating a report as a web page (which is an html file) using R +Markdown makes things easier. The report can be one long stream, so tall +figures that wouldn’t ordinarily fit on one page can be kept at full +size and easier to read, since the reader can simply keep scrolling. +Additionally, the formatting of and R Markdown document is simple and +easy to modify, allowing you to spend more time on your analyses instead +of writing reports.

+

Literate programming +

+

Ideally, such analysis reports are reproducible documents: +If an error is discovered, or if some additional subjects are added to +the data, you can just re-compile the report and get the new or +corrected results rather than having to reconstruct figures, paste them +into a Word document, and hand-edit various detailed results.

+

The key R package here is knitr. It allows you +to create a document that is a mixture of text and chunks of code. When +the document is processed by knitr, chunks of code will be +executed, and graphs or other results will be inserted into the final +document.

+

This sort of idea has been called “literate programming”.

+

knitr allows you to mix basically any type of text with +code from different programming languages, but we recommend that you use +R Markdown, which mixes Markdown with R. Markdown is a light-weight +mark-up language for creating web pages.

+

Creating an R Markdown file +

+

Within RStudio, click File → New File → R Markdown and you’ll get a +dialog box like this:

+
Screenshot of the New R Markdown file dialogue box in RStudio

You can stick with the default (HTML output), but give it a +title.

+

Basic components of R Markdown +

+

The initial chunk of text (header) contains instructions for R to +specify what kind of document will be created, and the options chosen. +You can use the header to give your document a title, author, date, and +tell it what type of output you want to produce. In this case, we’re +creating an html document.

+
---
+title: "Initial R Markdown document"
+author: "Karl Broman"
+date: "April 23, 2015"
+output: html_document
+---
+

You can delete any of those fields if you don’t want them included. +The double-quotes aren’t strictly necessary in this case. +They’re mostly needed if you want to include a colon in the title.

+

RStudio creates the document with some example text to get you +started. Note below that there are chunks like

+
+```{r}
+summary(cars)
+```
+
+

These are chunks of R code that will be executed by +knitr and replaced by their results. More on this +later.

+

Markdown +

+

Markdown is a system for writing web pages by marking up the text +much as you would in an email rather than writing html code. The +marked-up text gets converted to html, replacing the marks with +the proper html code.

+

For now, let’s delete all of the stuff that’s there and write a bit +of markdown.

+

You make things bold using two asterisks, like this: +**bold**, and you make things italics by using +underscores, like this: _italics_.

+

You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this:

+
A list:
+
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks
+

or like this:

+
A second list:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+

Each will appear as:

+
  • bold with double-asterisks
  • +
  • italics with underscores
  • +
  • code-type font with backticks
  • +

You can use whatever method you prefer, but be consistent. +This maintains the readability of your code.

+

You can make a numbered list by just using numbers. You can even use +the same number over and over if you want:

+
1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks
+

This will appear as:

+
  1. bold with double-asterisks
  2. +
  3. italics with underscores
  4. +
  5. code-type font with backticks
  6. +

You can make section headers of different sizes by initiating a line +with some number of # symbols:

+
# Title
+## Main section
+### Sub-section
+#### Sub-sub section
+

You compile the R Markdown document to an html webpage by +clicking the “Knit” button in the upper-left.

+
+
+ +
+
+

Challenge 1 +

+
+

Create a new R Markdown document. Delete all of the R code chunks and +write a bit of Markdown (some sections, some italicized text, and an +itemized list).

+

Convert the document to a webpage.

+
+
+
+
+
+ +
+
+

In RStudio, select File > New file > R Markdown…

+

Delete the placeholder text and add the following:

+
# Introduction
+
+## Background on Data
+
+This report uses the *gapminder* dataset, which has columns that include:
+
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+
+## Background on Methods
+
+

Then click the ‘Knit’ button on the toolbar to generate an html +document (webpage).

+
+
+
+
+

A bit more Markdown +

+

You can make a hyperlink like this: +[Carpentries Home Page](https://carpentries.org/).

+

You can include an image file like this: +![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)

+

You can do subscripts (e.g., F2) with F~2~ +and superscripts (e.g., F2) with F^2^.

+

If you know how to write equations in LaTeX, you can use +$ $ and $$ $$ to insert math equations, like +$E = mc^2$ and

+
$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$
+

You can review Markdown syntax by navigating to the “Markdown Quick +Reference” under the “Help” field in the toolbar at the top of +RStudio.

+

R code chunks +

+

The real power of Markdown comes from mixing markdown with chunks of +code. This is R Markdown. When processed, the R code will be executed; +if they produce figures, the figures will be inserted in the final +document.

+

The main code chunks look like this:

+
+```{r load_data}
+gapminder 
+

That is, you place a chunk of R code between ```{r +chunk_name} and ```. You should give each chunk a +unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the +shortcuts Ctrl+Alt+I on Windows and +Linux, or Cmd+Option+I on Mac.

+
+
+ +
+
+

Challenge 2 +

+
+

Add code chunks to:

+
  • Load the ggplot2 package
  • +
  • Read the gapminder data
  • +
  • Create a plot
  • +
+
+
+
+
+ +
+
+
+```{r load-ggplot2}
+library("ggplot2")
+```
+
+
+```{r read-gapminder-data}
+gapminder 
+
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+
+
+
+
+
+

How things get compiled +

+

When you press the “Knit” button, the R Markdown document is +processed by knitr +and a plain Markdown document is produced (as well as, potentially, a +set of figure files): the R code is executed and replaced by both the +input and the output; if figures are produced, links to those figures +are included.

+

The Markdown and figure documents are then processed by the tool pandoc, which converts the +Markdown file into an html file, with the figures embedded.

+

Chunk options +

+

There are a variety of options to affect how the code chunks are +treated. Here are some examples:

+
  • Use echo=FALSE to avoid having the code itself +shown.
  • +
  • Use results="hide" to avoid having any results +printed.
  • +
  • Use eval=FALSE to have the code shown but not +evaluated.
  • +
  • Use warning=FALSE and message=FALSE to +hide any warnings or messages produced.
  • +
  • Use fig.height and fig.width to control +the size of the figures produced (in inches).
  • +

So you might write:

+
+```{r load_libraries, echo=FALSE, message=FALSE}
+library("dplyr")
+library("ggplot2")
+```
+
+

Often there will be particular options that you’ll want to use +repeatedly; for this, you can set global chunk options, like +so:

+
+```{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)
+```
+
+

The fig.path option defines where the figures will be +saved. The / here is really important; without it, the +figures would be saved in the standard place but just with names that +begin with Figs.

+

If you have multiple R Markdown files in a common directory, you +might want to use fig.path to define separate prefixes for +the figure file names, like fig.path="Figs/cleaning-" and +fig.path="Figs/analysis-".

+
+
+ +
+
+

Challenge 3 +

+
+

Use chunk options to control the size of a figure and to hide the +code.

+
+
+
+
+
+ +
+
+
+```{r echo = FALSE, fig.width = 3}
+plot(faithful)
+```
+
+
+
+
+
+

You can review all of the R chunk options by navigating +to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the +“Help” field in the toolbar at the top of RStudio.

+

Inline R code +

+

You can make every number in your report reproducible. Use +`r and ` for an in-line code chunk, like so: +`r round(some_value, 2)`. The code will be executed and +replaced with the value of the result.

+

Don’t let these in-line chunks get split across lines.

+

Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with include=FALSE for +that larger chunk (which is the same as echo=FALSE and +results="hide").

+

Rounding can produce differences in output in such situations. You +may want 2.0, but round(2.03, 1) will give +just 2.

+

The myround +function in the R/broman +package handles this.

+
+
+ +
+
+

Challenge 4 +

+
+

Try out a bit of in-line R code.

+
+
+
+
+
+ +
+
+

Here’s some inline code to determine that 2 + 2 = 4.

+
+
+
+
+

Other output options +

+

You can also convert R Markdown to a PDF or a Word document. Click +the little triangle next to the “Knit” button to get a drop-down menu. +Or you could put pdf_document or word_document +in the initial header of the file.

+
+
+ +
+
+

Tip: Creating PDF documents +

+
+

Creating .pdf documents may require installation of some extra +software. The R package tinytex provides some tools to help +make this process easier for R users. With tinytex +installed, run tinytex::install_tinytex() to install the +required software (you’ll only need to do this once) and then when you +knit to pdf tinytex will automatically detect and install +any additional LaTeX packages that are needed to produce the pdf +document. Visit the tinytex +website for more information.

+
+
+
+
+
+ +
+
+

Tip: Visual markdown editing in RStudio +

+
+

RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.

+
+
+
+

Resources +

+
+
+ +
+
+

Keypoints +

+
+
  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/16-wrap-up.html b/instructor/16-wrap-up.html new file mode 100644 index 000000000..8313c0829 --- /dev/null +++ b/instructor/16-wrap-up.html @@ -0,0 +1,588 @@ + +R for Reproducible Scientific Analysis: Writing Good Software +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Writing Good Software

+

Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 15 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I write software that other people can use?
  • +
+
+
+
+
+
+

Objectives

+
  • Describe best practices for writing R and explain the justification +for each.
  • +
+
+
+
+
+

Structure your project folder +

+

Keep your project folder structured, organized and tidy, by creating +subfolders for your code files, manuals, data, binaries, output plots, +etc. It can be done completely manually, or with the help of RStudio’s +New Project functionality, or a designated package, such as +ProjectTemplate.

+
+
+ +
+
+

Tip: ProjectTemplate - a possible +solution +

+
+

One way to automate the management of projects is to install the +third-party package, ProjectTemplate. This package will set +up an ideal directory structure for project management. This is very +useful as it enables you to have your analysis pipeline/workflow +organised and structured. Together with the default RStudio project +functionality and Git you will be able to keep track of your work as +well as be able to share your work with collaborators.

+
  1. Install ProjectTemplate.
  2. +
  3. Load the library
  4. +
  5. Initialise the project:
  6. +
+

R +

+
+install.packages("ProjectTemplate")
+library("ProjectTemplate")
+create.project("../my_project_2", merge.strategy = "allow.non.conflict")
+
+

For more information on ProjectTemplate and its functionality visit +the home page ProjectTemplate

+
+
+
+

Make code readable +

+

The most important part of writing code is making it readable and +understandable. You want someone else to be able to pick up your code +and be able to understand what it does: more often than not this someone +will be you 6 months down the line, who will otherwise be cursing +past-self.

+

Documentation: tell us what and why, not how +

+

When you first start out, your comments will often describe what a +command does, since you’re still learning yourself and it can help to +clarify concepts and remind you later. However, these comments aren’t +particularly useful later on when you don’t remember what problem your +code is trying to solve. Try to also include comments that tell you +why you’re solving a problem, and what problem that +is. The how can come after that: it’s an implementation detail +you ideally shouldn’t have to worry about.

+

Keep your code modular +

+

Our recommendation is that you should separate your functions from +your analysis scripts, and store them in a separate file that you +source when you open the R session in your project. This +approach is nice because it leaves you with an uncluttered analysis +script, and a repository of useful functions that can be loaded into any +analysis script in your project. It also lets you group related +functions together easily.

+

Break down problem into bite size pieces +

+

When you first start out, problem solving and function writing can be +daunting tasks, and hard to separate from code inexperience. Try to +break down your problem into digestible chunks and worry about the +implementation details later: keep breaking down the problem into +smaller and smaller functions until you reach a point where you can code +a solution, and build back up from there.

+

Know that your code is doing the right thing +

+

Make sure to test your functions!

+

Don’t repeat yourself +

+

Functions enable easy reuse within a project. If you see blocks of +similar lines of code through your project, those are usually candidates +for being moved into functions.

+

If your calculations are performed through a series of functions, +then the project becomes more modular and easier to change. This is +especially the case for which a particular input always gives a +particular output.

+

Remember to be stylish +

+

Apply consistent style to your code.

+
+
+ +
+
+

Keypoints +

+
+
  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+
+
+
+
+ + +
+
+ + + diff --git a/instructor/404.html b/instructor/404.html new file mode 100644 index 000000000..fc2ef6605 --- /dev/null +++ b/instructor/404.html @@ -0,0 +1,451 @@ + +R for Reproducible Scientific Analysis: Page not found +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Page not found

+ +

Our apologies! +

+

We cannot seem to find the page you are looking for. Here are some +tips that may help:

+
  1. try going back to the previous +page or
  2. +
  3. navigate to any other page using the navigation bar on the +left.
  4. +
  5. if the URL ends with /index.html, try removing +that.
  6. +
  7. head over to the home page of this +lesson +
  8. +

If you came here from a link in this lesson, please contact the +lesson maintainers using the links at the foot of this page.

+
+
+ + +
+
+ + + diff --git a/instructor/CODE_OF_CONDUCT.html b/instructor/CODE_OF_CONDUCT.html new file mode 100644 index 000000000..2df159c96 --- /dev/null +++ b/instructor/CODE_OF_CONDUCT.html @@ -0,0 +1,451 @@ + +R for Reproducible Scientific Analysis: Contributor Code of Conduct +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Contributor Code of Conduct

+

Last updated on 2023-10-26 | + + Edit this page

+ + + + + +
+ +
+ + + +

As contributors and maintainers of this project, we pledge to follow +the The +Carpentries Code of Conduct.

+

Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our reporting +guidelines.

+ + + +
+
+ + +
+
+ + + diff --git a/instructor/LICENSE.html b/instructor/LICENSE.html new file mode 100644 index 000000000..3e3bc679a --- /dev/null +++ b/instructor/LICENSE.html @@ -0,0 +1,502 @@ + +R for Reproducible Scientific Analysis: Licenses +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Licenses

+

Last updated on 2023-10-26 | + + Edit this page

+ + + + + +
+ +
+ + + +

Instructional Material +

+

All Carpentries (Software Carpentry, Data Carpentry, and Library +Carpentry) instructional material is made available under the Creative Commons +Attribution license. The following is a human-readable summary of +(and not a substitute for) the full legal +text of the CC BY 4.0 license.

+

You are free:

+
  • to Share—copy and redistribute the material in any +medium or format
  • +
  • to Adapt—remix, transform, and build upon the +material
  • +

for any purpose, even commercially.

+

The licensor cannot revoke these freedoms as long as you follow the +license terms.

+

Under the following terms:

+
  • Attribution—You must give appropriate credit +(mentioning that your work is derived from work that is Copyright (c) +The Carpentries and, where practical, linking to https://carpentries.org/), provide a link to the +license, and indicate if changes were made. You may do so in any +reasonable manner, but not in any way that suggests the licensor +endorses you or your use.

  • +
  • No additional restrictions—You may not apply +legal terms or technological measures that legally restrict others from +doing anything the license permits. With the understanding +that:

  • +

Notices:

+
  • You do not have to comply with the license for elements of the +material in the public domain or where your use is permitted by an +applicable exception or limitation.
  • +
  • No warranties are given. The license may not give you all of the +permissions necessary for your intended use. For example, other rights +such as publicity, privacy, or moral rights may limit how you use the +material.
  • +

Software +

+

Except where otherwise noted, the example programs and other software +provided by The Carpentries are made available under the OSI-approved MIT +license.

+

Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +“Software”), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions:

+

The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software.

+

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. +IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY +CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, +TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE +SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

+

Trademark +

+

“The Carpentries”, “Software Carpentry”, “Data Carpentry”, and +“Library Carpentry” and their respective logos are registered trademarks +of Community Initiatives.

+
+
+ + +
+
+ + + diff --git a/instructor/aio.html b/instructor/aio.html new file mode 100644 index 000000000..fcb0086f6 --- /dev/null +++ b/instructor/aio.html @@ -0,0 +1,12669 @@ + + + + + +R for Reproducible Scientific Analysis: All in One View + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + + +
+
+ + +

Content from Introduction to R and RStudio

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 55 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How to find your way around RStudio?
  • +
  • How to interact with R?
  • +
  • How to manage your environment?
  • +
  • How to install packages?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Describe the purpose and use of each pane in the RStudio IDE
  • +
  • Locate buttons and options in the RStudio IDE
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Manage a workspace in an interactive R session
  • +
  • Use mathematical and comparison operators
  • +
  • Call functions
  • +
  • Manage packages
  • +
+
+
+
+
+
+

Motivation +

+
+

Science is a multi-step process: once you’ve designed an experiment +and collected data, the real fun begins! This lesson will teach you how +to start this process using R and RStudio. We will begin with raw data, +perform exploratory analyses, and learn how to plot results graphically. +This example starts with a dataset from gapminder.org containing population +information for many countries through time. Can you read the data into +R? Can you plot the population for Senegal? Can you calculate the +average income for countries on the continent of Asia? By the end of +these lessons you will be able to do things like plot the populations +for all of these countries in under a minute!

+

Before Starting The Workshop +

+
+

Please ensure you have the latest version of R and RStudio installed +on your machine. This is important, as some packages used in the +workshop may not install correctly (or at all) if R is not up to +date.

+

Introduction to RStudio +

+
+

Welcome to the R portion of the Software Carpentry workshop.

+

Throughout this lesson, we’re going to teach you some of the +fundamentals of the R language as well as some best practices for +organizing code for scientific projects that will make your life +easier.

+

We’ll be using RStudio: a free, open-source R Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms +(including on servers) and provides many advantages such as integration +with version control and project management.

+

Basic layout

+

When you first open RStudio, you will be greeted by three panels:

+
    +
  • The interactive R console/Terminal (entire left)
  • +
  • Environment/History/Connections (tabbed in upper right)
  • +
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • +
+
RStudio layout

Once you open files, such as R scripts, an editor panel will also +open in the top left.

+
RStudio layout with .R file open
+
+ +
+
+

R scripts +

+
+

Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have .R at the end of their +names to let you know what they are.

+
+
+
+

Workflow within RStudio +

+
+

There are two main ways one can work within RStudio:

+
    +
  1. Test and play within the interactive R console then copy code into a +.R file to run later.
  2. +
+
    +
  • This works well when doing small tests and initially starting +off.
  • +
  • It quickly becomes laborious
  • +
+
    +
  1. Start writing in a .R file and use RStudio’s short cut keys for the +Run command to push the current line, selected lines or modified lines +to the interactive R console.
  2. +
+
    +
  • This is a great way to start; all your code is saved for later
  • +
  • You will be able to run the file you create from within RStudio or +using R’s source() function.
  • +
+
+
+ +
+
+

Tip: Running segments of your code +

+
+

RStudio offers you great flexibility in running code from within the +editor window. There are buttons, menu choices, and keyboard shortcuts. +To run the current line, you can

+
    +
  1. click on the Run button above the editor panel, or
  2. +
  3. select “Run Lines” from the “Code” menu, or
  4. +
  5. hit Ctrl+Return in Windows or Linux or ++Return on OS X. (This shortcut can also be seen +by hovering the mouse over the button). To run a block of code, select +it and then Run. If you have modified a line of code within +a block of code you have just run, there is no need to reselect the +section and Run, you can use the next button along, +Re-run the previous region. This will run the previous code +block including the modifications you have made.
  6. +
+
+
+
+

Introduction to R +

+
+

Much of your time in R will be spent in the R interactive console. +This is where you will run all of your code, and can be a useful +environment to try out ideas before adding them to an R script file. +This console in RStudio is the same as the one you would get if you +typed in R in your command-line environment.

+

The first thing you will see in the R interactive session is a bunch +of information, followed by a “>” and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a “Read, evaluate, print +loop”: you type in commands, R tries to execute them, and then returns a +result.

+

Using R as a calculator +

+
+

The simplest thing you could do with R is to do arithmetic:

+
+

R +

+
+1 + 100
+
+
+

OUTPUT +

+
[1] 101
+
+

And R will print out the answer, with a preceding “[1]”. [1] is the +index of the first element of the line being printed in the console. For +more information on indexing vectors, see Episode +6: Subsetting Data.

+

If you type in an incomplete command, R will wait for you to complete +it. If you are familiar with Unix Shell’s bash, you may recognize +this
+behavior from bash.

+
+

R +

+
> 1 +
+
+
+

OUTPUT +

+
+
+
+

Any time you hit return and the R session shows a “+” instead of a +“>”, it means it’s waiting for you to complete the command. If you +want to cancel a command you can hit Esc and RStudio will +give you back the “>” prompt.

+
+
+ +
+
+

Tip: Canceling commands +

+
+

If you’re using R from the command line instead of from within +RStudio, you need to use Ctrl+C instead of +Esc to cancel the command. This applies to Mac users as +well!

+

Canceling a command isn’t only useful for killing incomplete +commands: you can also use it to tell R to stop running code (for +example if it’s taking much longer than you expect), or to get rid of +the code you’re currently writing.

+
+
+
+

When using R as a calculator, the order of operations is the same as +you would have learned back in school.

+

From highest to lowest precedence:

+
    +
  • Parentheses: (, ) +
  • +
  • Exponents: ^ or ** +
  • +
  • Multiply: * +
  • +
  • Divide: / +
  • +
  • Add: + +
  • +
  • Subtract: - +
  • +
+
+

R +

+
+3 + 5 * 2
+
+
+

OUTPUT +

+
[1] 13
+
+

Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend.

+
+

R +

+
+(3 + 5) * 2
+
+
+

OUTPUT +

+
[1] 16
+
+

This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code.

+
+

R +

+
+(3 + (5 * (2 ^ 2))) # hard to read
+3 + 5 * 2 ^ 2       # clear, if you remember the rules
+3 + 5 * (2 ^ 2)     # if you forget some rules, this might help
+
+

The text after each line of code is called a “comment”. Anything that +follows after the hash (or octothorpe) symbol # is ignored +by R when it executes code.

+

Really small or large numbers get a scientific notation:

+
+

R +

+
+2/10000
+
+
+

OUTPUT +

+
[1] 2e-04
+
+

Which is shorthand for “multiplied by 10^XX”. So +2e-4 is shorthand for 2 * 10^(-4).

+

You can write numbers in scientific notation too:

+
+

R +

+
+5e3  # Note the lack of minus here
+
+
+

OUTPUT +

+
[1] 5000
+
+

Mathematical functions +

+
+

R has many built in mathematical functions. To call a function, we +can type its name, followed by open and closing parentheses. Functions +take arguments as inputs, anything we type inside the parentheses of a +function is considered an argument. Depending on the function, the +number of arguments can vary from none to multiple. For example:

+
+

R +

+
+getwd() #returns an absolute filepath
+
+

doesn’t require an argument, whereas for the next set of mathematical +functions we will need to supply the function a value in order to +compute the result.

+
+

R +

+
+sin(1)  # trigonometry functions
+
+
+

OUTPUT +

+
[1] 0.841471
+
+
+

R +

+
+log(1)  # natural logarithm
+
+
+

OUTPUT +

+
[1] 0
+
+
+

R +

+
+log10(10) # base-10 logarithm
+
+
+

OUTPUT +

+
[1] 1
+
+
+

R +

+
+exp(0.5) # e^(1/2)
+
+
+

OUTPUT +

+
[1] 1.648721
+
+

Don’t worry about trying to remember every function in R. You can +look them up on Google, or if you can remember the start of the +function’s name, use the tab completion in RStudio.

+

This is one advantage that RStudio has over R on its own, it has +auto-completion abilities that allow you to more easily look up +functions, their arguments, and the values that they take.

+

Typing a ? before the name of a command will open the +help page for that command. When using RStudio, this will open the +‘Help’ pane; if using R in the terminal, the help page will open in your +browser. The help page will include a detailed description of the +command and how it works. Scrolling to the bottom of the help page will +usually show a collection of code examples which illustrate command +usage. We’ll go through an example later.

+

Comparing things +

+
+

We can also do comparisons in R:

+
+

R +

+
+1 == 1  # equality (note two equals signs, read as "is equal to")
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 != 2  # inequality (read as "is not equal to")
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 < 2  # less than
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 <= 1  # less than or equal to
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 > 0  # greater than
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+

R +

+
+1 >= -9 # greater than or equal to
+
+
+

OUTPUT +

+
[1] TRUE
+
+
+
+ +
+
+

Tip: Comparing Numbers +

+
+

A word of warning about comparing numbers: you should never use +== to compare two numbers unless they are integers (a data +type which can specifically represent only whole numbers).

+

Computers may only represent decimal numbers with a certain degree of +precision, so two numbers which look the same when printed out by R, may +actually have different underlying representations and therefore be +different by a small margin of error (called Machine numeric +tolerance).

+

Instead you should use the all.equal function.

+

Further reading: http://floating-point-gui.de/

+
+
+
+

Variables and assignment +

+
+

We can store values in variables using the assignment operator +<-, like this:

+
+

R +

+
+x <- 1/40
+
+

Notice that assignment does not print a value. Instead, we stored it +for later in something called a variable. +x now contains the value +0.025:

+
+

R +

+
+x
+
+
+

OUTPUT +

+
[1] 0.025
+
+

More precisely, the stored value is a decimal approximation +of this fraction called a floating point +number.

+

Look for the Environment tab in the top right panel of +RStudio, and you will see that x and its value have +appeared. Our variable x can be used in place of a number +in any calculation that expects a number:

+
+

R +

+
+log(x)
+
+
+

OUTPUT +

+
[1] -3.688879
+
+

Notice also that variables can be reassigned:

+
+

R +

+
+x <- 100
+
+

x used to contain the value 0.025 and now it has the +value 100.

+

Assignment values can contain the variable being assigned to:

+
+

R +

+
+x <- x + 1 #notice how RStudio updates its description of x on the top right tab
+y <- x * 2
+
+

The right hand side of the assignment can be any valid R expression. +The right hand side is fully evaluated before the assignment +occurs.

+

Variable names can contain letters, numbers, underscores and periods +but no spaces. They must start with a letter or a period followed by a +letter (they cannot start with a number nor an underscore). Variables +beginning with a period are hidden variables. Different people use +different conventions for long variable names, these include

+
    +
  • periods.between.words
  • +
  • underscores_between_words
  • +
  • camelCaseToSeparateWords
  • +
+

What you use is up to you, but be consistent.

+

It is also possible to use the = operator for +assignment:

+
+

R +

+
+x = 1/40
+
+

But this is much less common among R users. The most important thing +is to be consistent with the operator you use. There +are occasionally places where it is less confusing to use +<- than =, and it is the most common symbol +used in the community. So the recommendation is to use +<-.

+
+
+ +
+
+

Challenge 1 +

+
+

Which of the following are valid R variable names?

+
+

R +

+
min_height
+max.height
+_age
+.mass
+MaxLength
+min-length
+2widths
+celsius2kelvin
+
+
+
+
+
+
+ +
+
+

The following can be used as R variables:

+
+

R +

+
+min_height
+max.height
+MaxLength
+celsius2kelvin
+
+

The following creates a hidden variable:

+
+

R +

+
+.mass
+
+

The following will not be able to be used to create a variable

+
+

R +

+
_age
+min-length
+2widths
+
+
+
+
+
+

Vectorization +

+
+

One final thing to be aware of is that R is vectorized, +meaning that variables and functions can have vectors as values. In +contrast to physics and mathematics, a vector in R describes a set of +values in a certain order of the same data type. For example

+
+

R +

+
+1:5
+
+
+

OUTPUT +

+
[1] 1 2 3 4 5
+
+
+

R +

+
+2^(1:5)
+
+
+

OUTPUT +

+
[1]  2  4  8 16 32
+
+
+

R +

+
+x <- 1:5
+2^x
+
+
+

OUTPUT +

+
[1]  2  4  8 16 32
+
+

This is incredibly powerful; we will discuss this further in an +upcoming lesson.

+

Managing your environment +

+
+

There are a few useful commands you can use to interact with the R +session.

+

ls will list all of the variables and functions stored +in the global environment (your working R session):

+
+

R +

+
+ls()
+
+
+

OUTPUT +

+
[1] "x" "y"
+
+
+
+ +
+
+

Tip: hidden objects +

+
+

Like in the shell, ls will hide any variables or +functions starting with a “.” by default. To list all objects, type +ls(all.names=TRUE) instead

+
+
+
+

Note here that we didn’t give any arguments to ls, but +we still needed to give the parentheses to tell R to call the +function.

+

If we type ls by itself, R prints a bunch of code +instead of a listing of objects.

+
+

R +

+
+ls
+
+
+

OUTPUT +

+
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
+    pattern, sorted = TRUE) 
+{
+    if (!missing(name)) {
+        pos <- tryCatch(name, error = function(e) e)
+        if (inherits(pos, "error")) {
+            name <- substitute(name)
+            if (!is.character(name)) 
+                name <- deparse(name)
+            warning(gettextf("%s converted to character string", 
+                sQuote(name)), domain = NA)
+            pos <- name
+        }
+    }
+    all.names <- .Internal(ls(envir, all.names, sorted))
+    if (!missing(pattern)) {
+        if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
+            ll != length(grep("]", pattern, fixed = TRUE))) {
+            if (pattern == "[") {
+                pattern <- "\\["
+                warning("replaced regular expression pattern '[' by  '\\\\['")
+            }
+            else if (length(grep("[^\\\\]\\[<-", pattern))) {
+                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+            }
+        }
+        grep(pattern, all.names, value = TRUE)
+    }
+    else all.names
+}
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>
+
+

What’s going on here?

+

Like everything in R, ls is the name of an object, and +entering the name of an object by itself prints the contents of the +object. The object x that we created earlier contains 1, 2, +3, 4, 5:

+
+

R +

+
+x
+
+
+

OUTPUT +

+
[1] 1 2 3 4 5
+
+

The object ls contains the R code that makes the +ls function work! We’ll talk more about how functions work +and start writing our own later.

+

You can use rm to delete objects you no longer need:

+
+

R +

+
+rm(x)
+
+

If you have lots of things in your environment and want to delete all +of them, you can pass the results of ls to the +rm function:

+
+

R +

+
+rm(list = ls())
+
+

In this case we’ve combined the two. Like the order of operations, +anything inside the innermost parentheses is evaluated first, and so +on.

+

In this case we’ve specified that the results of ls +should be used for the list argument in rm. +When assigning values to arguments by name, you must use the += operator!!

+

If instead we use <-, there will be unintended side +effects, or you may get an error message:

+
+

R +

+
+rm(list <- ls())
+
+
+

ERROR +

+
Error in rm(list <- ls()): ... must contain names or character strings
+
+
+
+ +
+
+

Tip: Warnings vs. Errors +

+
+

Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn’t worked as expected.

+

In both cases, the message that R prints out usually give you clues +how to fix a problem.

+
+
+
+

R Packages +

+
+

It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages:

+
    +
  • You can see what packages are installed by typing +installed.packages() +
  • +
  • You can install packages by typing +install.packages("packagename"), where +packagename is the package name, in quotes.
  • +
  • You can update installed packages by typing +update.packages() +
  • +
  • You can remove a package with +remove.packages("packagename") +
  • +
  • You can make a package available for use with +library(packagename) +
  • +
+

Packages can also be viewed, loaded, and detached in the Packages tab +of the lower right panel in RStudio. Clicking on this tab will display +all of the installed packages with a checkbox next to them. If the box +next to a package name is checked, the package is loaded and if it is +empty, the package is not loaded. Click an empty box to load that +package and click a checked box to detach that package.

+

Packages can be installed and updated from the Package tab with the +Install and Update buttons at the top of the tab.

+
+
+ +
+
+

Challenge 2 +

+
+

What will be the value of each variable after each statement in the +following program?

+
+

R +

+
+mass <- 47.5
+age <- 122
+mass <- mass * 2.3
+age <- age - 20
+
+
+
+
+
+
+ +
+
+
+

R +

+
+mass <- 47.5
+
+

This will give a value of 47.5 for the variable mass

+
+

R +

+
+age <- 122
+
+

This will give a value of 122 for the variable age

+
+

R +

+
+mass <- mass * 2.3
+
+

This will multiply the existing value of 47.5 by 2.3 to give a new +value of 109.25 to the variable mass.

+
+

R +

+
+age <- age - 20
+
+

This will subtract 20 from the existing value of 122 to give a new +value of 102 to the variable age.

+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age?

+
+
+
+
+
+ +
+
+

One way of answering this question in R is to use the +> to set up the following:

+
+

R +

+
+mass > age
+
+
+

OUTPUT +

+
[1] TRUE
+
+

This should yield a boolean value of TRUE since 109.25 is greater +than 102.

+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Clean up your working environment by deleting the mass and age +variables.

+
+
+
+
+
+ +
+
+

We can use the rm command to accomplish this task

+
+

R +

+
+rm(age, mass)
+
+
+
+
+
+
+
+ +
+
+

Challenge 5 +

+
+

Install the following packages: ggplot2, +plyr, gapminder

+
+
+
+
+
+ +
+
+

We can use the install.packages() command to install the +required packages.

+
+

R +

+
+install.packages("ggplot2")
+install.packages("plyr")
+install.packages("gapminder")
+
+

An alternate solution, to install multiple packages with a single +install.packages() command is:

+
+

R +

+
+install.packages(c("ggplot2", "plyr", "gapminder"))
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +
+
+
+
+

Content from Project Management With RStudio

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 30 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I manage my projects in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Create self-contained projects in RStudio
  • +
+
+
+
+
+
+

Introduction +

+
+

The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and eventually +everything is a bit mixed together.

+ +

Most people tend to organize their projects like this:

+
Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

+
    +
  1. It is really hard to tell which version of your data is the original +and which is the modified;
  2. +
  3. It gets really messy because it mixes files with various extensions +together;
  4. +
  5. It probably takes you a lot of time to actually find things, and +relate the correct figures to the exact code that has been used to +generate it;
  6. +
+

A good project layout will ultimately make your life easier:

+
    +
  • It will help ensure the integrity of your data;
  • +
  • It makes it simpler to share your code with someone else (a +lab-mate, collaborator, or supervisor);
  • +
  • It allows you to easily upload your code with your manuscript +submission;
  • +
  • It makes it easier to pick the project back up after a break.
  • +

A possible solution +

+
+

Fortunately, there are tools and packages which can help you manage +your work effectively.

+

One of the most powerful and useful aspects of RStudio is its project +management functionality. We’ll be using this today to create a +self-contained, reproducible project.

+
+
+ +
+
+

Challenge 1: Creating a self-contained +project +

+
+

We’re going to create a new project in RStudio:

+
    +
  1. Click the “File” menu button, then “New Project”.
  2. +
  3. Click “New Directory”.
  4. +
  5. Click “New Project”.
  6. +
  7. Type in the name of the directory to store your project, +e.g. “my_project”.
  8. +
  9. If available, select the checkbox for “Create a git +repository.”
  10. +
  11. Click the “Create Project” button.
  12. +
+
+
+
+

The simplest way to open an RStudio project once it has been created +is to click through your file system to get to the directory where it +was saved and double click on the .Rproj file. This will +open RStudio and start your R session in the same directory as the +.Rproj file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added +benefit of allowing you to open multiple projects at the same time each +open to its own project directory. This allows you to keep multiple +projects open without them interfering with each other.

+
+
+ +
+
+

Challenge 2: Opening an RStudio project +through the file system +

+
+
    +
  1. Exit RStudio.
  2. +
  3. Navigate to the directory where you created a project in Challenge +1.
  4. +
  5. Double click on the .Rproj file in that directory.
  6. +
+
+
+
+

Best practices for project organization +

+
+

Although there is no “best” way to lay out a project, there are some +general principles to adhere to that will make project management +easier:

+
+

Treat data as read only +

+

This is probably the most important goal of setting up a project. +Data is typically time consuming and/or expensive to collect. Working +with them interactively (e.g., in Excel) where they can be modified +means you are never sure of where the data came from, or how it has been +modified since collection. It is therefore a good idea to treat your +data as “read-only”.

+
+
+

Data Cleaning +

+

In many cases your data will be “dirty”: it will need significant +preprocessing to get into a format R (or any other programming language) +will find useful. This task is sometimes called “data munging”. Storing +these scripts in a separate folder, and creating a second “read-only” +data folder to hold the “cleaned” data sets can prevent confusion +between the two sets.

+
+
+

Treat generated output as disposable +

+

Anything generated by your scripts should be treated as disposable: +it should all be able to be regenerated from your scripts.

+

There are lots of different ways to manage this output. Having an +output folder with different sub-directories for each separate analysis +makes it easier later. Since many analyses are exploratory and don’t end +up being used in the final project, and some of the analyses get shared +between projects.

+
+
+ +
+
+

Tip: Good Enough Practices for Scientific +Computing +

+
+

Good +Enough Practices for Scientific Computing gives the following +recommendations for project organization:

+
    +
  1. Put each project in its own directory, which is named after the +project.
  2. +
  3. Put text documents associated with the project in the +doc directory.
  4. +
  5. Put raw data and metadata in the data directory, and +files generated during cleanup and analysis in a results +directory.
  6. +
  7. Put source for the project’s scripts and programs in the +src directory, and programs brought in from elsewhere or +compiled locally in the bin directory.
  8. +
  9. Name all files to reflect their content or function.
  10. +
+
+
+
+
+
+

Separate function definition and application +

+

One of the more effective ways to work with R is to start by writing +the code you want to run directly in a .R script, and then running the +selected lines (either using the keyboard shortcuts in RStudio or +clicking the “Run” button) in the interactive R console.

+

When your project is in its early stages, the initial .R script file +usually contains many lines of directly executed code. As it matures, +reusable chunks get pulled into their own functions. It’s a good idea to +separate these functions into two separate folders; one to store useful +functions that you’ll reuse across analyses and projects, and one to +store the analysis scripts.

+
+
+

Save the data in the data directory +

+

Now we have a good directory structure we will now place/save the +data file in the data/ directory.

+
+
+ +
+
+

Challenge 3 +

+
+

Download the gapminder data from here.

+
    +
  1. Download the file (right mouse click on the link above -> “Save +link as” / “Save file as”, or click on the link and after the page +loads, press Ctrl+S or choose File -> “Save +page as”)
  2. +
  3. Make sure it’s saved under the name +gapminder_data.csv +
  4. +
  5. Save the file in the data/ folder within your +project.
  6. +
+

We will load and inspect these data later.

+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

It is useful to get some general idea about the dataset, directly +from the command line, before loading it into R. Understanding the +dataset better will come in handy when making decisions on how to load +it in R. Use the command-line shell to answer the following +questions:

+
    +
  1. What is the size of the file?
  2. +
  3. How many rows of data does it contain?
  4. +
  5. What kinds of values are stored in this file?
  6. +
+
+
+
+
+
+ +
+
+

By running these commands in the shell:

+
+

SH +

+
ls -lh data/gapminder_data.csv
+
+
+

OUTPUT +

+
-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv
+
+

The file size is 80K.

+
+

SH +

+
wc -l data/gapminder_data.csv
+
+
+

OUTPUT +

+
1705 data/gapminder_data.csv
+
+

There are 1705 lines. The data looks like:

+
+

SH +

+
head data/gapminder_data.csv
+
+
+

OUTPUT +

+
country,year,pop,continent,lifeExp,gdpPercap
+Afghanistan,1952,8425333,Asia,28.801,779.4453145
+Afghanistan,1957,9240934,Asia,30.332,820.8530296
+Afghanistan,1962,10267083,Asia,31.997,853.10071
+Afghanistan,1967,11537966,Asia,34.02,836.1971382
+Afghanistan,1972,13079460,Asia,36.088,739.9811058
+Afghanistan,1977,14880372,Asia,38.438,786.11336
+Afghanistan,1982,12881816,Asia,39.854,978.0114388
+Afghanistan,1987,13867957,Asia,40.822,852.3959448
+Afghanistan,1992,16317921,Asia,41.674,649.3413952
+
+
+
+
+
+
+
+ +
+
+

Tip: command line in RStudio +

+
+

The Terminal tab in the console pane provides a convenient place +directly within RStudio to interact directly with the command line.

+
+
+
+
+
+

Working directory +

+

Knowing R’s current working directory is important because when you +need to access other files (for example, to import a data file), R will +look for them relative to the current working directory.

+

Each time you create a new RStudio Project, it will create a new +directory for that project. When you open an existing +.Rproj file, it will open that project and set R’s working +directory to the folder that file is in.

+
+
+ +
+
+

Challenge 5 +

+
+

You can check the current working directory with the +getwd() command, or by using the menus in RStudio.

+
    +
  1. In the console, type getwd() (“wd” is short for +“working directory”) and hit Enter.
  2. +
  3. In the Files pane, double click on the data folder to +open it (or navigate to any other folder you wish). To get the Files +pane back to the current working directory, click “More” and then select +“Go To Working Directory”.
  4. +
+

You can change the working directory with setwd(), or by +using RStudio menus.

+
    +
  1. In the console, type setwd("data") and hit Enter. Type +getwd() and hit Enter to see the new working +directory.
  2. +
  3. In the menus at the top of the RStudio window, click the “Session” +menu button, and then select “Set Working Directory” and then “Choose +Directory”. Next, in the windows navigator that opens, navigate back to +the project directory, and click “Open”. Note that a setwd +command will automatically appear in the console.
  4. +
+
+
+
+
+
+ +
+
+

Tip: File does not exist errors +

+
+

When you’re attempting to reference a file in your R code and you’re +getting errors saying the file doesn’t exist, it’s a good idea to check +your working directory. You need to either provide an absolute path to +the file, or you need to make sure the file is saved in the working +directory (or a subfolder of the working directory) and provide a +relative path.

+
+
+
+
+
+

Version Control +

+

It is important to use version control with projects. Go here +for a good lesson which describes using Git with RStudio.

+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +
+
+
+
+
+

Content from Seeking Help

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 20 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I get help in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to read R help files for functions and special +operators.
  • +
  • To be able to use CRAN task views to identify packages to solve a +problem.
  • +
  • To be able to seek help from your peers.
  • +
+
+
+
+
+
+

Reading Help Files +

+
+

R, and every package, provide help files for functions. The general +syntax to search for help on any function, “function_name”, from a +specific function that is in a package loaded into your namespace (your +interactive R session) is:

+
+

R +

+
+?function_name
+help(function_name)
+
+

For example take a look at the help file for +write.table(), we will be using a similar function in an +upcoming episode.

+
+

R +

+
+?write.table()
+
+

This will load up a help page in RStudio (or as plain text in R +itself).

+

Each help page is broken down into sections:

+
    +
  • Description: An extended description of what the function does.
  • +
  • Usage: The arguments of the function and their default values (which +can be changed).
  • +
  • Arguments: An explanation of the data each argument is +expecting.
  • +
  • Details: Any important details to be aware of.
  • +
  • Value: The data the function returns.
  • +
  • See Also: Any related functions you might find useful.
  • +
  • Examples: Some examples for how to use the function.
  • +
+

Different functions might have different sections, but these are the +main ones you should be aware of.

+

Notice how related functions might call for the same help file:

+
+

R +

+
+?write.table()
+?write.csv()
+
+

This is because these functions have very similar applicability and +often share the same arguments as inputs to the function, so package +authors often choose to document them together in a single help +file.

+
+
+ +
+
+

Tip: Running Examples +

+
+

From within the function help page, you can highlight code in the +Examples and hit Ctrl+Return to run it in RStudio +console. This gives you a quick way to get a feel for how a function +works.

+
+
+
+
+
+ +
+
+

Tip: Reading Help Files +

+
+

One of the most daunting aspects of R is the large number of +functions available. It would be prohibitive, if not impossible to +remember the correct usage for every function you use. Luckily, using +the help files means you don’t have to remember that!

+
+
+
+

Special Operators +

+
+

To seek help on special operators, use quotes or backticks:

+
+

R +

+
+?"<-"
+?`<-`
+
+

Getting Help with Packages +

+
+

Many packages come with “vignettes”: tutorials and extended example +documentation. Without any arguments, vignette() will list +all vignettes for all installed packages; +vignette(package="package-name") will list all available +vignettes for package-name, and +vignette("vignette-name") will open the specified +vignette.

+

If a package doesn’t have any vignettes, you can usually find help by +typing help("package-name").

+

RStudio also has a set of excellent cheatsheets for +many packages.

+

When You Remember Part of the Function Name +

+
+

If you’re not sure what package a function is in or how it’s +specifically spelled, you can do a fuzzy search:

+
+

R +

+
+??function_name
+
+

A fuzzy search is when you search for an approximate string match. +For example, you may remember that the function to set your working +directory includes “set” in its name. You can do a fuzzy search to help +you identify the function:

+
+

R +

+
+??set
+
+

When You Have No Idea Where to Begin +

+
+

If you don’t know what function or package you need to use CRAN Task Views is a +specially maintained list of packages grouped into fields. This can be a +good starting point.

+

When Your Code Doesn’t Work: Seeking Help from Your Peers +

+
+

If you’re having trouble using a function, 9 times out of 10, the +answers you seek have already been answered on Stack Overflow. You can search +using the [r] tag. Please make sure to see their page on how to ask a good +question.

+

If you can’t find the answer, there are a few useful functions to +help you ask your peers:

+
+

R +

+
+?dput
+
+

Will dump the data you’re working with into a format that can be +copied and pasted by others into their own R session.

+
+

R +

+
+sessionInfo()
+
+
+

OUTPUT +

+
R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
+ [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
+ [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
+[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
+
+time zone: UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats     graphics  grDevices utils     datasets  methods   base     
+
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0 yaml_2.3.7       
+[5] knitr_1.43        xfun_0.40         renv_1.0.3        evaluate_0.21    
+
+

Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue.

+
+
+ +
+
+

Challenge 1 +

+
+

Look at the help page for the c function. What kind of +vector do you expect will be created if you evaluate the following:

+
+

R +

+
+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+
+
+
+
+
+
+ +
+
+

The c() function creates a vector, in which all elements +are of the same type. In the first case, the elements are numeric, in +the second, they are characters, and in the third they are also +characters: the numeric values are “coerced” to be characters.

+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Look at the help for the paste function. You will need +to use it later. What’s the difference between the sep and +collapse arguments?

+
+
+
+
+
+ +
+
+

To look at the help for the paste() function, use:

+
+

R +

+
+help("paste")
+?paste
+
+

The difference between sep and collapse is +a little tricky. The paste function accepts any number of +arguments, each of which can be a vector of any length. The +sep argument specifies the string used between concatenated +terms — by default, a space. The result is a vector as long as the +longest argument supplied to paste. In contrast, +collapse specifies that after concatenation the elements +are collapsed together using the given separator, the result +being a single string.

+

It is important to call the arguments explicitly by typing out the +argument name e.g sep = "," so the function understands to +use the “,” as a separator and not a term to concatenate. e.g.

+
+

R +

+
+paste(c("a","b"), "c")
+
+
+

OUTPUT +

+
[1] "a c" "b c"
+
+
+

R +

+
+paste(c("a","b"), "c", ",")
+
+
+

OUTPUT +

+
[1] "a c ," "b c ,"
+
+
+

R +

+
+paste(c("a","b"), "c", sep = ",")
+
+
+

OUTPUT +

+
[1] "a,c" "b,c"
+
+
+

R +

+
+paste(c("a","b"), "c", collapse = "|")
+
+
+

OUTPUT +

+
[1] "a c|b c"
+
+
+

R +

+
+paste(c("a","b"), "c", sep = ",", collapse = "|")
+
+
+

OUTPUT +

+
[1] "a,c|b,c"
+
+

(For more information, scroll to the bottom of the +?paste help page and look at the examples, or try +example('paste').)

+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Use help to find a function (and its associated parameters) that you +could use to load data from a tabular file in which columns are +delimited with “\t” (tab) and the decimal point is a “.” (period). This +check for decimal separator is important, especially if you are working +with international colleagues, because different countries have +different conventions for the decimal point (i.e. comma vs period). +Hint: use ??"read table" to look up functions related to +reading in tabular data.

+
+
+
+
+
+ +
+
+

The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +read.table(file, sep="\t") (the period is the +default decimal separator for read.table()), +although you may have to change the comment.char argument +as well if your data file contains hash (#) characters.

+
+
+
+
+

Other Resources +

+
+ +
+
+ +
+
+

Keypoints +

+
+
    +
  • Use help() to get online help in R.
  • +
+
+
+
+

Content from Data Structures

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 55 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I read data in R?
  • +
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to identify the 5 main data types.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
  • To understand the information of the attributes “names”, “class”, +and “dim”.
  • +
+
+
+
+
+
+

One of R’s most powerful features is its ability to deal with tabular +data - such as you may already have in a spreadsheet or a CSV file. +Let’s start by making a toy dataset in your data/ +directory, called feline-data.csv:

+
+

R +

+
+cats <- data.frame(coat = c("calico", "black", "tabby"),
+                    weight = c(2.1, 5.0, 3.2),
+                    likes_string = c(1, 0, 1))
+
+

We can now save cats as a CSV file. It is good practice +to call the argument names explicitly so the function knows what default +values you are changing. Here we are setting +row.names = FALSE. Recall you can use +?write.csv to pull up the help file to check out the +argument names and their default values.

+
+

R +

+
+write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)
+
+

The contents of the new file, feline-data.csv:

+
+

R +

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+
+
+
+ +
+
+

Tip: Editing Text files in R +

+
+

Alternatively, you can create data/feline-data.csv using +a text editor (Nano), or within RStudio with the File -> New +File -> Text File menu item.

+
+
+
+

We can load this into R via the following:

+
+

R +

+
+cats <- read.csv(file = "data/feline-data.csv")
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1
+
+

The read.table function is used for reading in tabular +data stored in a text file where the columns of data are separated by +punctuation characters such as CSV files (csv = comma-separated values). +Tabs and commas are the most common punctuation characters used to +separate or delimit data points in csv files. For convenience R provides +2 other versions of read.table. These are: +read.csv for files where the data are separated with commas +and read.delim for files where the data are separated with +tabs. Of these three functions read.csv is the most +commonly used. If needed it is possible to override the default +delimiting punctuation marks for both read.csv and +read.delim.

+
+
+ +
+
+

Check your data for factors +

+
+

In recent times, the default way how R handles textual data has +changed. Text data was interpreted by R automatically into a format +called “factors”. But there is an easier format that is called +“character”. We will hear about factors later, and what to use them for. +For now, remember that in most cases, they are not needed and only +complicate your life, which is why newer R versions read in text as +“character”. Check now if your version of R has automatically created +factors and convert them to “character” format:

+
    +
  1. Check the data types of your input by typing +str(cats) +
  2. +
  3. In the output, look at the three-letter codes after the colons: If +you see only “num” and “chr”, you can continue with the lesson and skip +this box. If you find “fct”, continue to step 3.
  4. +
  5. Prevent R from automatically creating “factor” data. That can be +done by the following code: +options(stringsAsFactors = FALSE). Then, re-read the cats +table for the change to take effect.
  6. +
  7. You must set this option every time you restart R. To not forget +this, include it in your analysis script before you read in any data, +for example in one of the first lines.
  8. +
  9. For R versions greater than 4.0.0, text data is no longer converted +to factors anymore. So you can install this or a newer version to avoid +this problem. If you are working on an institute or company computer, +ask your administrator to do it.
  10. +
+
+
+
+

We can begin exploring our dataset right away, pulling out columns by +specifying them using the $ operator:

+
+

R +

+
+cats$weight
+
+
+

OUTPUT +

+
[1] 2.1 5.0 3.2
+
+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

We can do other operations on the columns:

+
+

R +

+
+## Say we discovered that the scale weighs two Kg light:
+cats$weight + 2
+
+
+

OUTPUT +

+
[1] 4.1 7.0 5.2
+
+
+

R +

+
+paste("My cat is", cats$coat)
+
+
+

OUTPUT +

+
[1] "My cat is calico" "My cat is black"  "My cat is tabby" 
+
+

But what about

+
+

R +

+
+cats$weight + cats$coat
+
+
+

ERROR +

+
Error in cats$weight + cats$coat: non-numeric argument to binary operator
+
+

Understanding what happened here is key to successfully analyzing +data in R.

+
+

Data Types +

+

If you guessed that the last command will return an error because +2.1 plus "black" is nonsense, you’re right - +and you already have some intuition for an important concept in +programming called data types. We can ask what type of data +something is:

+
+

R +

+
+typeof(cats$weight)
+
+
+

OUTPUT +

+
[1] "double"
+
+

There are 5 main types: double, integer, +complex, logical and character. +For historic reasons, double is also called +numeric.

+
+

R +

+
+typeof(3.14)
+
+
+

OUTPUT +

+
[1] "double"
+
+
+

R +

+
+typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers
+
+
+

OUTPUT +

+
[1] "integer"
+
+
+

R +

+
+typeof(1+1i)
+
+
+

OUTPUT +

+
[1] "complex"
+
+
+

R +

+
+typeof(TRUE)
+
+
+

OUTPUT +

+
[1] "logical"
+
+
+

R +

+
+typeof('banana')
+
+
+

OUTPUT +

+
[1] "character"
+
+

No matter how complicated our analyses become, all data in R is +interpreted as one of these basic data types. This strictness has some +really important consequences.

+

A user has added details of another cat. This information is in the +file data/feline-data_v2.csv.

+
+

R +

+
+file.show("data/feline-data_v2.csv")
+
+
+

R +

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+tabby,2.3 or 2.4,1
+
+

Load the new cats data like before, and check what type of data we +find in the weight column:

+
+

R +

+
+cats <- read.csv(file="data/feline-data_v2.csv")
+typeof(cats$weight)
+
+
+

OUTPUT +

+
[1] "character"
+
+

Oh no, our weights aren’t the double type anymore! If we try to do +the same math we did on them before, we run into trouble:

+
+

R +

+
+cats$weight + 2
+
+
+

ERROR +

+
Error in cats$weight + 2: non-numeric argument to binary operator
+
+

What happened? The cats data we are working with is +something called a data frame. Data frames are one of the most +common and versatile types of data structures we will work with +in R. A given column in a data frame cannot be composed of different +data types. In this case, R does not read everything in the data frame +column weight as a double, therefore the entire +column data type changes to something that is suitable for everything in +the column.

+

When R reads a csv file, it reads it in as a data frame. +Thus, when we loaded the cats csv file, it is stored as a +data frame. We can recognize data frames by the first row that is +written by the str() function:

+
+

R +

+
+str(cats)
+
+
+

OUTPUT +

+
'data.frame':	4 obs. of  3 variables:
+ $ coat        : chr  "calico" "black" "tabby" "tabby"
+ $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
+ $ likes_string: int  1 0 1 1
+
+

Data frames are composed of rows and columns, where each +column has the same number of rows. Different columns in a data frame +can be made up of different data types (this is what makes them so +versatile), but everything in a given column needs to be the same type +(e.g., vector, factor, or list).

+

Let’s explore more about different data structures and how they +behave. For now, let’s remove that extra line from our cats data and +reload it, while we investigate this behavior further:

+

feline-data.csv:

+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+

And back in RStudio:

+
+

R +

+
+cats <- read.csv(file="data/feline-data.csv")
+
+
+
+

Vectors and Type Coercion +

+

To better understand this behavior, let’s meet another of the data +structures: the vector.

+
+

R +

+
+my_vector <- vector(length = 3)
+my_vector
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE
+
+

A vector in R is essentially an ordered list of things, with the +special condition that everything in the vector must be the same +basic data type. If you don’t choose the datatype, it’ll default to +logical; or, you can declare an empty vector of whatever +type you like.

+
+

R +

+
+another_vector <- vector(mode='character', length=3)
+another_vector
+
+
+

OUTPUT +

+
[1] "" "" ""
+
+

You can check if something is a vector:

+
+

R +

+
+str(another_vector)
+
+
+

OUTPUT +

+
 chr [1:3] "" "" ""
+
+

The somewhat cryptic output from this command indicates the basic +data type found in this vector - in this case chr, +character; an indication of the number of things in the vector - +actually, the indexes of the vector, in this case [1:3]; +and a few examples of what’s actually in the vector - in this case empty +character strings. If we similarly do

+
+

R +

+
+str(cats$weight)
+
+
+

OUTPUT +

+
 num [1:3] 2.1 5 3.2
+
+

we see that cats$weight is a vector, too - the +columns of data we load into R data.frames are all vectors, and +that’s the root of why R forces everything in a column to be the same +basic data type.

+
+
+ +
+
+

Discussion 1 +

+
+

Why is R so opinionated about what we put in our columns of data? How +does this help us?

+
+
+ +
+
+

By keeping everything in a column the same, we allow ourselves to +make simple assumptions about our data; if you can interpret one entry +in the column as a number, then you can interpret all of them +as numbers, so we don’t have to check every time. This consistency is +what people mean when they talk about clean data; in the long +run, strict consistency goes a long way to making our lives easier in +R.

+
+
+
+
+
+
+
+
+

Coercion by combining vectors +

+

You can also make vectors with explicit contents with the combine +function:

+
+

R +

+
+combine_vector <- c(2,6,3)
+combine_vector
+
+
+

OUTPUT +

+
[1] 2 6 3
+
+

Given what we’ve learned so far, what do you think the following will +produce?

+
+

R +

+
+quiz_vector <- c(2,6,'3')
+
+

This is something called type coercion, and it is the source +of many surprises and the reason why we need to be aware of the basic +data types and how R will interpret them. When R encounters a mix of +types (here double and character) to be combined into a single vector, +it will force them all to be the same type. Consider:

+
+

R +

+
+coercion_vector <- c('a', TRUE)
+coercion_vector
+
+
+

OUTPUT +

+
[1] "a"    "TRUE"
+
+
+

R +

+
+another_coercion_vector <- c(0, TRUE)
+another_coercion_vector
+
+
+

OUTPUT +

+
[1] 0 1
+
+
+
+

The type hierarchy +

+

The coercion rules go: logical -> +integer -> double (“numeric”) +-> complex -> character, where -> can +be read as are transformed into. For example, combining +logical and character transforms the result to +character:

+
+

R +

+
+c('a', TRUE)
+
+
+

OUTPUT +

+
[1] "a"    "TRUE"
+
+

A quick way to recognize character vectors is by the +quotes that enclose them when they are printed.

+

You can try to force coercion against this flow using the +as. functions:

+
+

R +

+
+character_vector_example <- c('0','2','4')
+character_vector_example
+
+
+

OUTPUT +

+
[1] "0" "2" "4"
+
+
+

R +

+
+character_coerced_to_double <- as.double(character_vector_example)
+character_coerced_to_double
+
+
+

OUTPUT +

+
[1] 0 2 4
+
+
+

R +

+
+double_coerced_to_logical <- as.logical(character_coerced_to_double)
+double_coerced_to_logical
+
+
+

OUTPUT +

+
[1] FALSE  TRUE  TRUE
+
+

As you can see, some surprising things can happen when R forces one +basic data type into another! Nitty-gritty of type coercion aside, the +point is: if your data doesn’t look like what you thought it was going +to look like, type coercion may well be to blame; make sure everything +is the same type in your vectors and your columns of data.frames, or you +will get nasty surprises!

+

But coercion can also be very useful! For example, in our +cats data likes_string is numeric, but we know +that the 1s and 0s actually represent TRUE and +FALSE (a common way of representing them). We should use +the logical datatype here, which has two states: +TRUE or FALSE, which is exactly what our data +represents. We can ‘coerce’ this column to be logical by +using the as.logical function:

+
+

R +

+
+cats$likes_string
+
+
+

OUTPUT +

+
[1] 1 0 1
+
+
+

R +

+
+cats$likes_string <- as.logical(cats$likes_string)
+cats$likes_string
+
+
+

OUTPUT +

+
[1]  TRUE FALSE  TRUE
+
+
+
+ +
+
+

Challenge 1 +

+
+

An important part of every data analysis is cleaning the input data. +If you know that the input data is all of the same format, +(e.g. numbers), your analysis is much easier! Clean the cat data set +from the chapter about type coercion.

+
+

Copy the code template +

+

Create a new script in RStudio and copy and paste the following code. +Then move on to the tasks below, which help you to fill in the gaps +(______).

+
# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+
+# 1. Print the data
+_____
+
+# 2. Show an overview of the table with all data types
+_____(cats)
+
+# 3. The "weight" column has the incorrect data type __________.
+#    The correct data type is: ____________.
+
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+#    print the data again to see the effect
+cats
+
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+
+#    Calculate the mean to test yourself
+mean(cats$weight)
+
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!
+
+
+

Instructions for the tasks +

+
+ +

Execute the first statement (read.csv(...)). Then print +the data to the console

+
+
+
+
+
+
+
+ +
+
+

Show the content of any variable by typing its name.

+
+

Solution to Challenge 1.1 +

+

Two correct solutions:

+
cats
+print(cats)
+
+
+
+
+
+
+
+ +
+
+

2. Overview of the data types +

+
+

The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of +the cats table.

+
+
+
+
+
+ +
+
+

In the chapter “Data types” we saw two functions that can show data +types. One printed just a single word, the data type name. The other +printed a short form of the data type, and the first few values. We need +the second here.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to Challenge 1.2

+
str(cats)
+
+
+

3. Which data type do we need? +

+

The shown data type is not the right one for this data (weight of a +cat). Which data type do we need?

+
    +
  • Why did the read.csv() function not choose the correct +data type?
  • +
  • Fill in the gap in the comment with the correct data type for cat +weight!
  • +
+
+
+
+
+
+
+ +
+
+

Scroll up to the section about the type +hierarchy to review the available data types

+
+
+
+
+
+
+ +
+
+
    +
  • Weight is expressed on a continuous scale (real numbers). The R data +type for this is “double” (also known as “numeric”).
  • +
  • The fourth row has the value “2.3 or 2.4”. That is not a number but +two, and an english word. Therefore, the “character” data type is +chosen. The whole column is now text, because all values in the same +columns have to be the same data type.
  • +
+
+
+
+
+
+
+ +
+
+

4. Correct the problematic value +

+
+

The code to assign a new weight value to the problematic fourth row +is given. Think first and then execute it: What will be the data type +after assigning a number like in this example? You can check the data +type after executing to see if you were right.

+
+
+
+
+
+ +
+
+

Revisit the hierarchy of data types when two different data types are +combined.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to challenge 1.4

+

The data type of the column “weight” is “character”. The assigned +data type is “double”. Combining two data types yields the data type +that is higher in the following hierarchy:

+
logical < integer < double < complex < character
+

Therefore, the column is still of type character! We need to manually +convert it to “double”. {: .solution}

+
+
+

5. Convert the column “weight” to the correct data type +

+

Cat weight are numbers. But the column does not have this data type +yet. Coerce the column to floating point numbers.

+
+
+
+
+
+
+ +
+
+

The functions to convert data types start with as.. You +can look for the function further up in the manuscript or use the +RStudio auto-complete function: Type “as.” and then press +the TAB key.

+
+
+
+
+
+
+ +
+
+

Challenge 1 (continued) +

+
+
+

Solution to Challenge 1.5

+

There are two functions that are synonymous for historic reasons:

+
cats$weight <- as.double(cats$weight)
+cats$weight <- as.numeric(cats$weight)
+
+
+
+
+
+
+
+

Some basic vector functions +

+

The combine function, c(), will also append things to an +existing vector:

+
+

R +

+
+ab_vector <- c('a', 'b')
+ab_vector
+
+
+

OUTPUT +

+
[1] "a" "b"
+
+
+

R +

+
+combine_example <- c(ab_vector, 'SWC')
+combine_example
+
+
+

OUTPUT +

+
[1] "a"   "b"   "SWC"
+
+

You can also make series of numbers:

+
+

R +

+
+mySeries <- 1:10
+mySeries
+
+
+

OUTPUT +

+
 [1]  1  2  3  4  5  6  7  8  9 10
+
+
+

R +

+
+seq(10)
+
+
+

OUTPUT +

+
 [1]  1  2  3  4  5  6  7  8  9 10
+
+
+

R +

+
+seq(1,10, by=0.1)
+
+
+

OUTPUT +

+
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
+[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
+[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
+[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
+[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
+[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
+[91] 10.0
+
+

We can ask a few questions about vectors:

+
+

R +

+
+sequence_example <- 20:25
+head(sequence_example, n=2)
+
+
+

OUTPUT +

+
[1] 20 21
+
+
+

R +

+
+tail(sequence_example, n=4)
+
+
+

OUTPUT +

+
[1] 22 23 24 25
+
+
+

R +

+
+length(sequence_example)
+
+
+

OUTPUT +

+
[1] 6
+
+
+

R +

+
+typeof(sequence_example)
+
+
+

OUTPUT +

+
[1] "integer"
+
+

We can get individual elements of a vector by using the bracket +notation:

+
+

R +

+
+first_element <- sequence_example[1]
+first_element
+
+
+

OUTPUT +

+
[1] 20
+
+

To change a single element, use the bracket on the other side of the +arrow:

+
+

R +

+
+sequence_example[1] <- 30
+sequence_example
+
+
+

OUTPUT +

+
[1] 30 21 22 23 24 25
+
+
+
+ +
+
+

Challenge 2 +

+
+

Start by making a vector with the numbers 1 through 26. Then, +multiply the vector by 2.

+
+
+
+
+
+ +
+
+
+

R +

+
+x <- 1:26
+x <- x * 2
+
+
+
+
+
+
+
+

Lists +

+

Another data structure you’ll want in your bag of tricks is the +list. A list is simpler in some ways than the other types, +because you can put anything you want in it. Remember everything in +the vector must be of the same basic data type, but a list can have +different data types:

+
+

R +

+
+list_example <- list(1, "a", TRUE, 1+4i)
+list_example
+
+
+

OUTPUT +

+
[[1]]
+[1] 1
+
+[[2]]
+[1] "a"
+
+[[3]]
+[1] TRUE
+
+[[4]]
+[1] 1+4i
+
+

When printing the object structure with str(), we see +the data types of all elements:

+
+

R +

+
+str(list_example)
+
+
+

OUTPUT +

+
List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i
+
+

What is the use of lists? They can organize data of different +types. For example, you can organize different tables that +belong together, similar to spreadsheets in Excel. But there are many +other uses, too.

+

We will see another example that will maybe surprise you in the next +chapter.

+

To retrieve one of the elements of a list, use the double +bracket:

+
+

R +

+
+list_example[[2]]
+
+
+

OUTPUT +

+
[1] "a"
+
+

The elements of lists also can have names, they can +be given by prepending them to the values, separated by an equals +sign:

+
+

R +

+
+another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
+another_list
+
+
+

OUTPUT +

+
$title
+[1] "Numbers"
+
+$numbers
+ [1]  1  2  3  4  5  6  7  8  9 10
+
+$data
+[1] TRUE
+
+

This results in a named list. Now we have a new +function of our object! We can access single elements by an additional +way!

+
+

R +

+
+another_list$title
+
+
+

OUTPUT +

+
[1] "Numbers"
+
+
+

Names +

+
+

With names, we can give meaning to elements. It is the first time +that we do not only have the data, but also explaining +information. It is metadata that can be stuck to the object +like a label. In R, this is called an attribute. Some +attributes enable us to do more with our object, for example, like here, +accessing an element by a self-defined name.

+
+

Accessing vectors and lists by name +

+

We have already seen how to generate a named list. The way to +generate a named vector is very similar. You have seen this function +before:

+
+

R +

+
+pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )
+
+

The way to retrieve elements is different, though:

+
+

R +

+
+pizza_price["pizzasubito"]
+
+
+

OUTPUT +

+
pizzasubito 
+       5.64 
+
+

The approach used for the list does not work:

+
+

R +

+
+pizza_price$pizzafresh
+
+
+

ERROR +

+
Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors
+
+

It will pay off if you remember this error message, you will meet it +in your own analyses. It means that you have just tried accessing an +element like it was in a list, but it is actually in a vector.

+
+
+

Accessing and changing names +

+

If you are only interested in the names, use the names() +function:

+
+

R +

+
+names(pizza_price)
+
+
+

OUTPUT +

+
[1] "pizzasubito" "pizzafresh"  "callapizza" 
+
+

We have seen how to access and change single elements of a vector. +The same is possible for names:

+
+

R +

+
+names(pizza_price)[3]
+
+
+

OUTPUT +

+
[1] "callapizza"
+
+
+

R +

+
+names(pizza_price)[3] <- "call-a-pizza"
+pizza_price
+
+
+

OUTPUT +

+
 pizzasubito   pizzafresh call-a-pizza 
+        5.64         6.60         4.50 
+
+
+
+ +
+
+

Challenge 3 +

+
+
    +
  • What is the data type of the names of pizza_price? You +can find out using the str() or typeof() +functions.
  • +
+
+
+
+
+
+ +
+
+

You get the names of an object by wrapping the object name inside +names(...). Similarly, you get the data type of the names +by again wrapping the whole code in typeof(...):

+
typeof(names(pizza))
+

alternatively, use a new variable if this is easier for you to +read:

+
n <- names(pizza)
+typeof(n)
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Instead of just changing some of the names a vector/list already has, +you can also set all names of an object by writing code like (replace +ALL CAPS text):

+
names( OBJECT ) <-  CHARACTER_VECTOR
+

Create a vector that gives the number for each letter in the +alphabet!

+
    +
  1. Generate a vector called letter_no with the sequence of +numbers from 1 to 26!
  2. +
  3. R has a built-in object called LETTERS. It is a +26-character vector, from A to Z. Set the names of the number sequence +to this 26 letters
  4. +
  5. Test yourself by calling letter_no["B"], which should +give you the number 2!
  6. +
+
+
+
+
+
+ +
+
+
letter_no <- 1:26   # or seq(1,26)
+names(letter_no) <- LETTERS
+letter_no["B"]
+
+
+
+
+
+

Data frames +

+
+

We have data frames at the very beginning of this lesson, they +represent a table of data. We didn’t go much further into detail with +our example cat data frame:

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+2  black    5.0        FALSE
+3  tabby    3.2         TRUE
+
+

We can now understand something a bit surprising in our data.frame; +what happens if we run:

+
+

R +

+
+typeof(cats)
+
+
+

OUTPUT +

+
[1] "list"
+
+

We see that data.frames look like lists ‘under the hood’. Think again +what we heard about what lists can be used for:

+
+

Lists organize data of different types

+
+

Columns of a data frame are vectors of different types, that are +organized by belonging to the same table.

+

A data.frame is really a list of vectors. It is a special list in +which all the vectors must have the same length.

+

How is this “special”-ness written into the object, so that R does +not treat it like any other list, but as a table?

+
+

R +

+
+class(cats)
+
+
+

OUTPUT +

+
[1] "data.frame"
+
+

A class, just like names, is an attribute attached +to the object. It tells us what this object means for humans.

+

You might wonder: Why do we need another +what-type-of-object-is-this-function? We already have +typeof()? That function tells us how the object is +constructed in the computer. The class is +the meaning of the object for humans. Consequently, +what typeof() returns is fixed in R (mainly the +five data types), whereas the output of class() is +diverse and extendable by R packages.

+

In our cats example, we have an integer, a double and a +logical variable. As we have seen already, each column of data.frame is +a vector.

+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+
+

R +

+
+cats[,1]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+
+

R +

+
+typeof(cats[,1])
+
+
+

OUTPUT +

+
[1] "character"
+
+
+

R +

+
+str(cats[,1])
+
+
+

OUTPUT +

+
 chr [1:3] "calico" "black" "tabby"
+
+

Each row is an observation of different variables, itself a +data.frame, and thus can be composed of elements of different types.

+
+

R +

+
+cats[1,]
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+
+
+

R +

+
+typeof(cats[1,])
+
+
+

OUTPUT +

+
[1] "list"
+
+
+

R +

+
+str(cats[1,])
+
+
+

OUTPUT +

+
'data.frame':	1 obs. of  3 variables:
+ $ coat        : chr "calico"
+ $ weight      : num 2.1
+ $ likes_string: logi TRUE
+
+
+
+ +
+
+

Challenge 5 +

+
+

There are several subtly different ways to call variables, +observations and elements from data.frames:

+
    +
  • cats[1]
  • +
  • cats[[1]]
  • +
  • cats$coat
  • +
  • cats["coat"]
  • +
  • cats[1, 1]
  • +
  • cats[, 1]
  • +
  • cats[1, ]
  • +
+

Try out these examples and explain what is returned by each one.

+

Hint: Use the function typeof() to examine what +is returned in each case.

+
+
+
+
+
+ +
+
+
+

R +

+
+cats[1]
+
+
+

OUTPUT +

+
    coat
+1 calico
+2  black
+3  tabby
+
+

We can think of a data frame as a list of vectors. The single brace +[1] returns the first slice of the list, as another list. +In this case it is the first column of the data frame.

+
+

R +

+
+cats[[1]]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

The double brace [[1]] returns the contents of the list +item. In this case it is the contents of the first column, a +vector of type character.

+
+

R +

+
+cats$coat
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

This example uses the $ character to address items by +name. coat is the first column of the data frame, again a +vector of type character.

+
+

R +

+
+cats["coat"]
+
+
+

OUTPUT +

+
    coat
+1 calico
+2  black
+3  tabby
+
+

Here we are using a single brace ["coat"] replacing the +index number with the column name. Like example 1, the returned object +is a list.

+
+

R +

+
+cats[1, 1]
+
+
+

OUTPUT +

+
[1] "calico"
+
+

This example uses a single brace, but this time we provide row and +column coordinates. The returned object is the value in row 1, column 1. +The object is a vector of type character.

+
+

R +

+
+cats[, 1]
+
+
+

OUTPUT +

+
[1] "calico" "black"  "tabby" 
+
+

Like the previous example we use single braces and provide row and +column coordinates. The row coordinate is not specified, R interprets +this missing value as all the elements in this column and +returns them as a vector.

+
+

R +

+
+cats[1, ]
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1         TRUE
+
+

Again we use the single brace with row and column coordinates. The +column coordinate is not specified. The return value is a list +containing all the values in the first row.

+
+
+
+
+
+
+ +
+
+

Tip: Renaming data frame columns +

+
+

Data frames have column names, which can be accessed with the +names() function.

+
+

R +

+
+names(cats)
+
+
+

OUTPUT +

+
[1] "coat"         "weight"       "likes_string"
+
+

If you want to rename the second column of cats, you can +assign a new name to the second element of names(cats).

+
+

R +

+
+names(cats)[2] <- "weight_kg"
+cats
+
+
+

OUTPUT +

+
    coat weight_kg likes_string
+1 calico       2.1         TRUE
+2  black       5.0        FALSE
+3  tabby       3.2         TRUE
+
+
+
+
+
+

Matrices +

+

Last but not least is the matrix. We can declare a matrix full of +zeros:

+
+

R +

+
+matrix_example <- matrix(0, ncol=6, nrow=3)
+matrix_example
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    0    0    0    0    0    0
+[2,]    0    0    0    0    0    0
+[3,]    0    0    0    0    0    0
+
+

What makes it special is the dim() attribute:

+
+

R +

+
+dim(matrix_example)
+
+
+

OUTPUT +

+
[1] 3 6
+
+

And similar to other data structures, we can ask things about our +matrix:

+
+

R +

+
+typeof(matrix_example)
+
+
+

OUTPUT +

+
[1] "double"
+
+
+

R +

+
+class(matrix_example)
+
+
+

OUTPUT +

+
[1] "matrix" "array" 
+
+
+

R +

+
+str(matrix_example)
+
+
+

OUTPUT +

+
 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
+
+
+

R +

+
+nrow(matrix_example)
+
+
+

OUTPUT +

+
[1] 3
+
+
+

R +

+
+ncol(matrix_example)
+
+
+

OUTPUT +

+
[1] 6
+
+
+
+ +
+
+

Challenge 6 +

+
+

What do you think will be the result of +length(matrix_example)? Try it. Were you right? Why / why +not?

+
+
+
+
+
+ +
+
+

What do you think will be the result of +length(matrix_example)?

+
+

R +

+
+matrix_example <- matrix(0, ncol=6, nrow=3)
+length(matrix_example)
+
+
+

OUTPUT +

+
[1] 18
+
+

Because a matrix is a vector with added dimension attributes, +length gives you the total number of elements in the +matrix.

+
+
+
+
+
+
+ +
+
+

Challenge 7 +

+
+

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+
+
+
+
+
+ +
+
+

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+
+

R +

+
+x <- matrix(1:50, ncol=5, nrow=10)
+x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
+
+
+
+
+
+
+
+ +
+
+

Challenge 8 +

+
+

Create a list of length two containing a character vector for each of +the sections in this part of the workshop:

+
    +
  • Data types
  • +
  • Data structures
  • +
+

Populate each character vector with the names of the data types and +data structures we’ve seen so far.

+
+
+
+
+
+ +
+
+
+

R +

+
+dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
+dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
+answer <- list(dataTypes, dataStructures)
+
+

Note: it’s nice to make a list in big writing on the board or taped +to the wall listing all of these types and structures - leave it up for +the rest of the workshop to remind people of the importance of these +basics.

+
+
+
+
+
+
+ +
+
+

Challenge 9 +

+
+

Consider the R output of the matrix below:

+
+

OUTPUT +

+
     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7
+
+

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

+
    +
  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. +
  3. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  4. +
  5. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  6. +
  7. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
  8. +
+
+
+
+
+
+ +
+
+

Consider the R output of the matrix below:

+
+

OUTPUT +

+
     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7
+
+

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

+
+

R +

+
+matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +
+
+
+
+
+

Content from Exploring Data Frames

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 30 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I manipulate a data frame?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Add and remove rows or columns.
  • +
  • Append two data frames.
  • +
  • Display basic properties of data frames including size and class of +the columns, names, and first few rows.
  • +
+
+
+
+
+
+

At this point, you’ve seen it all: in the last lesson, we toured all +the basic data types and data structures in R. Everything you do will be +a manipulation of those tools. But most of the time, the star of the +show is the data frame—the table that we created by loading information +from a csv file. In this lesson, we’ll learn a few more things about +working with data frames.

+

Adding columns and rows in data frames +

+
+

We already learned that the columns of a data frame are vectors, so +that our data are consistent in type throughout the columns. As such, if +we want to add a new column, we can start by making a new vector:

+
+

R +

+
+age <- c(2, 3, 5)
+cats
+
+
+

OUTPUT +

+
    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1
+
+

We can then add this as a column via:

+
+

R +

+
+cbind(cats, age)
+
+
+

OUTPUT +

+
    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5
+
+

Note that if we tried to add a vector of ages with a different number +of entries than the number of rows in the data frame, it would fail:

+
+

R +

+
+age <- c(2, 3, 5, 12)
+cbind(cats, age)
+
+
+

ERROR +

+
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4
+
+
+

R +

+
+age <- c(2, 3)
+cbind(cats, age)
+
+
+

ERROR +

+
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
+
+

Why didn’t this work? Of course, R wants to see one element in our +new column for every row in the table:

+
+

R +

+
+nrow(cats)
+
+
+

OUTPUT +

+
[1] 3
+
+
+

R +

+
+length(age)
+
+
+

OUTPUT +

+
[1] 2
+
+

So for it to work we need to have nrow(cats) = +length(age). Let’s overwrite the content of cats with our +new data frame.

+
+

R +

+
+age <- c(2, 3, 5)
+cats <- cbind(cats, age)
+
+

Now how about adding rows? We already know that the rows of a data +frame are lists:

+
+

R +

+
+newRow <- list("tortoiseshell", 3.3, TRUE, 9)
+cats <- rbind(cats, newRow)
+
+

Let’s confirm that our new row was added correctly.

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+
+

Removing rows +

+
+

We now know how to add rows and columns to our data frame in R. Now +let’s learn to remove rows.

+
+

R +

+
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+
+

We can ask for a data frame minus the last row:

+
+

R +

+
+cats[-4, ]
+
+
+

OUTPUT +

+
    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5
+
+

Notice the comma with nothing after it to indicate that we want to +drop the entire fourth row.

+

Note: we could also remove several rows at once by putting the row +numbers inside of a vector, for example: +cats[c(-3,-4), ]

+

Removing columns +

+
+

We can also remove columns in our data frame. What if we want to +remove the column “age”. We can remove it in two ways, by variable +number or by index.

+
+

R +

+
+cats[,-4]
+
+
+

OUTPUT +

+
           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1
+
+

Notice the comma with nothing before it, indicating we want to keep +all of the rows.

+

Alternatively, we can drop the column by using the index name and the +%in% operator. The %in% operator goes through +each element of its left argument, in this case the names of +cats, and asks, “Does this element occur in the second +argument?”

+
+

R +

+
+drop <- names(cats) %in% c("age")
+cats[,!drop]
+
+
+

OUTPUT +

+
           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1
+
+

We will cover subsetting with logical operators like +%in% in more detail in the next episode. See the section Subsetting through other logical +operations

+

Appending to a data frame +

+
+

The key to remember when adding data to a data frame is that +columns are vectors and rows are lists. We can also glue two +data frames together with rbind:

+
+

R +

+
+cats <- rbind(cats, cats)
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+
+

But now the row names are unnecessarily complicated. We can remove +the rownames, and R will automatically re-name them sequentially:

+
+

R +

+
+rownames(cats) <- NULL
+cats
+
+
+

OUTPUT +

+
           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+
+
+
+ +
+
+

Challenge 1 +

+
+

You can create a new data frame right from within R with the +following syntax:

+
+

R +

+
+df <- data.frame(id = c("a", "b", "c"),
+                 x = 1:3,
+                 y = c(TRUE, TRUE, FALSE))
+
+

Make a data frame that holds the following information for +yourself:

+
    +
  • first name
  • +
  • last name
  • +
  • lucky number
  • +
+

Then use rbind to add an entry for the people sitting +beside you. Finally, use cbind to add a column with each +person’s answer to the question, “Is it time for coffee break?”

+
+
+
+
+
+ +
+
+
+

R +

+
+df <- data.frame(first = c("Grace"),
+                 last = c("Hopper"),
+                 lucky_number = c(0))
+df <- rbind(df, list("Marie", "Curie", 238) )
+df <- cbind(df, coffeetime = c(TRUE,TRUE))
+
+
+
+
+
+

Realistic example +

+
+

So far, you have seen the basics of manipulating data frames with our +cat data; now let’s use those skills to digest a more realistic dataset. +Let’s read in the gapminder dataset that we downloaded +previously:

+
+

R +

+
+gapminder <- read.csv("data/gapminder_data.csv")
+
+
+
+ +
+
+

Miscellaneous Tips +

+
+
    +
  • Another type of file you might encounter are tab-separated value +files (.tsv). To specify a tab as a separator, use "\\t" or +read.delim().

  • +
  • Files can also be downloaded directly from the Internet into a +local folder of your choice onto your computer using the +download.file function. The read.csv function +can then be executed to read the downloaded file from the download +location, for example,

  • +
+
+

R +

+
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv("data/gapminder_data.csv")
+
+
    +
  • Alternatively, you can also read in files directly into R from the +Internet by replacing the file paths with a web address in +read.csv. One should note that in doing this no local copy +of the csv file is first saved onto your computer. For example,
  • +
+
+

R +

+
+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
+
+
    +
  • You can read directly from excel spreadsheets without converting +them to plain text first by using the readxl +package.

  • +
  • The argument “stringsAsFactors” can be useful to tell R how to +read strings either as factors or as character strings. In R versions +after 4.0, all strings are read-in as characters by default, but in +earlier versions of R, strings are read-in as factors by default. For +more information, see the call-out in the +previous episode.

  • +
+
+
+
+

Let’s investigate gapminder a bit; the first thing we should always +do is check out what the data looks like with str:

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+

An additional method for examining the structure of gapminder is to +use the summary function. This function can be used on +various objects in R. For data frames, summary yields a +numeric, tabular, or descriptive summary of each column. Numeric or +integer columns are described by the descriptive statistics (quartiles +and mean), and character columns by its length, class, and mode.

+
+

R +

+
+summary(gapminder)
+
+
+

OUTPUT +

+
   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  
+
+

Along with the str and summary functions, +we can examine individual columns of the data frame with our +typeof function:

+
+

R +

+
+typeof(gapminder$year)
+
+
+

OUTPUT +

+
[1] "integer"
+
+
+

R +

+
+typeof(gapminder$country)
+
+
+

OUTPUT +

+
[1] "character"
+
+
+

R +

+
+str(gapminder$country)
+
+
+

OUTPUT +

+
 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+
+

We can also interrogate the data frame for information about its +dimensions; remembering that str(gapminder) said there were +1704 observations of 6 variables in gapminder, what do you think the +following will produce, and why?

+
+

R +

+
+length(gapminder)
+
+
+

OUTPUT +

+
[1] 6
+
+

A fair guess would have been to say that the length of a data frame +would be the number of rows it has (1704), but this is not the case; +remember, a data frame is a list of vectors and factors:

+
+

R +

+
+typeof(gapminder)
+
+
+

OUTPUT +

+
[1] "list"
+
+

When length gave us 6, it’s because gapminder is built +out of a list of 6 columns. To get the number of rows and columns in our +dataset, try:

+
+

R +

+
+nrow(gapminder)
+
+
+

OUTPUT +

+
[1] 1704
+
+
+

R +

+
+ncol(gapminder)
+
+
+

OUTPUT +

+
[1] 6
+
+

Or, both at once:

+
+

R +

+
+dim(gapminder)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+

We’ll also likely want to know what the titles of all the columns +are, so we can ask for them later:

+
+

R +

+
+colnames(gapminder)
+
+
+

OUTPUT +

+
[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
+
+

At this stage, it’s important to ask ourselves if the structure R is +reporting matches our intuition or expectations; do the basic data types +reported for each column make sense? If not, we need to sort any +problems out now before they turn into bad surprises down the road, +using what we’ve learned about how R interprets data, and the importance +of strict consistency in how we record our data.

+

Once we’re happy that the data types and structures seem reasonable, +it’s time to start digging into our data proper. Check out the first few +lines:

+
+

R +

+
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+
+
+
+ +
+
+

Challenge 2 +

+
+

It’s good practice to also check the last few lines of your data and +some in the middle. How would you do this?

+

Searching for ones specifically in the middle isn’t too hard, but we +could ask for a few lines at random. How would you code this?

+
+
+
+
+
+ +
+
+

To check the last few lines it’s relatively simple as R already has a +function for this:

+
+

R +

+
+tail(gapminder)
+tail(gapminder, n = 15)
+
+

What about a few arbitrary rows just in case something is odd in the +middle?

+
+

Tip: There are several ways to achieve this. +

+

The solution here presents one form of using nested functions, i.e. a +function passed as an argument to another function. This might sound +like a new concept, but you are already using it! Remember +my_dataframe[rows, cols] will print to screen your data frame with the +number of rows and columns you asked for (although you might have asked +for a range or named columns for example). How would you get the last +row if you don’t know how many rows your data frame has? R has a +function for this. What about getting a (pseudorandom) sample? R also +has a function for this.

+
+

R +

+
+gapminder[sample(nrow(gapminder), 5), ]
+
+
+
+
+
+
+

To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later.

+
+
+ +
+
+

Challenge 3 +

+
+

Go to file -> new file -> R script, and write an R script to +load in the gapminder dataset. Put it in the scripts/ +directory and add it to version control.

+

Run the script using the source function, using the file +path as its argument (or by pressing the “source” button in +RStudio).

+
+
+
+
+
+ +
+
+

The source function can be used to use a script within a +script. Assume you would like to load the same type of file over and +over again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again and +again you could just write it once and save it as a script. Then, you +can use source("Your_Script_containing_the_load_function") +in a new script to use the function of that script without writing +everything again. Check out ?source to find out more.

+
+

R +

+
+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv(file = "data/gapminder_data.csv")
+
+

To run the script and load the data into the gapminder +variable:

+
+

R +

+
+source(file = "scripts/load-gapminder.R")
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Read the output of str(gapminder) again; this time, use +what you’ve learned about lists and vectors, as well as the output of +functions like colnames and dim to explain +what everything that str prints out for gapminder means. If +there are any parts you can’t interpret, discuss with your +neighbors!

+
+
+
+
+
+ +
+
+

The object gapminder is a data frame with columns

+
    +
  • +country and continent are character +strings.
  • +
  • +year is an integer vector.
  • +
  • +pop, lifeExp, and gdpPercap +are numeric vectors.
  • +
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +
+
+
+
+

Content from Subsetting Data

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 50 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I work with subsets of data in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to subset vectors, factors, matrices, lists, and data +frames
  • +
  • To be able to extract individual and multiple elements: by index, by +name, using comparison operations
  • +
  • To be able to skip and remove elements from various data +structures.
  • +
+
+
+
+
+
+

R has many powerful subset operators. Mastering them will allow you +to easily perform complex operations on any kind of dataset.

+

There are six different ways we can subset any kind of object, and +three different subsetting operators for the different data +structures.

+

Let’s start with the workhorse of R: a simple numeric vector.

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+x
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+
+
+ +
+
+

Atomic vectors +

+
+

In R, simple vectors containing character strings, numbers, or +logical values are called atomic vectors because they can’t be +further simplified.

+
+
+
+

So now that we’ve created a dummy vector to play with, how do we get +at its contents?

+

Accessing elements using their indices +

+
+

To extract elements of a vector we can give their corresponding +index, starting from one:

+
+

R +

+
+x[1]
+
+
+

OUTPUT +

+
  a 
+5.4 
+
+
+

R +

+
+x[4]
+
+
+

OUTPUT +

+
  d 
+4.8 
+
+

It may look different, but the square brackets operator is a +function. For vectors (and matrices), it means “get me the nth +element”.

+

We can ask for multiple elements at once:

+
+

R +

+
+x[c(1, 3)]
+
+
+

OUTPUT +

+
  a   c 
+5.4 7.1 
+
+

Or slices of the vector:

+
+

R +

+
+x[1:4]
+
+
+

OUTPUT +

+
  a   b   c   d 
+5.4 6.2 7.1 4.8 
+
+

the : operator creates a sequence of numbers from the +left element to the right.

+
+

R +

+
+1:4
+
+
+

OUTPUT +

+
[1] 1 2 3 4
+
+
+

R +

+
+c(1, 2, 3, 4)
+
+
+

OUTPUT +

+
[1] 1 2 3 4
+
+

We can ask for the same element multiple times:

+
+

R +

+
+x[c(1,1,3)]
+
+
+

OUTPUT +

+
  a   a   c 
+5.4 5.4 7.1 
+
+

If we ask for an index beyond the length of the vector, R will return +a missing value:

+
+

R +

+
+x[6]
+
+
+

OUTPUT +

+
<NA> 
+  NA 
+
+

This is a vector of length one containing an NA, whose +name is also NA.

+

If we ask for the 0th element, we get an empty vector:

+
+

R +

+
+x[0]
+
+
+

OUTPUT +

+
named numeric(0)
+
+
+
+ +
+
+

Vector numbering in R starts at 1 +

+
+

In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1.

+
+
+
+

Skipping and removing elements +

+
+

If we use a negative number as the index of a vector, R will return +every element except for the one specified:

+
+

R +

+
+x[-2]
+
+
+

OUTPUT +

+
  a   c   d   e 
+5.4 7.1 4.8 7.5 
+
+

We can skip multiple elements:

+
+

R +

+
+x[c(-1, -5)]  # or x[-c(1,5)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+
+ +
+
+

Tip: Order of operations +

+
+

A common trip up for novices occurs when trying to skip slices of a +vector. It’s natural to try to negate a sequence like so:

+
+

R +

+
+x[-1:3]
+
+

This gives a somewhat cryptic error:

+
+

ERROR +

+
Error in x[-1:3]: only 0's may be mixed with negative subscripts
+
+

But remember the order of operations. : is really a +function. It takes its first argument as -1, and its second as 3, so +generates the sequence of numbers: c(-1, 0, 1, 2, 3).

+

The correct solution is to wrap that function call in brackets, so +that the - operator applies to the result:

+
+

R +

+
+x[-(1:3)]
+
+
+

OUTPUT +

+
  d   e 
+4.8 7.5 
+
+
+
+
+

To remove elements from a vector, we need to assign the result back +into the variable:

+
+

R +

+
+x <- x[-4]
+x
+
+
+

OUTPUT +

+
  a   b   c   e 
+5.4 6.2 7.1 7.5 
+
+
+
+ +
+
+

Challenge 1 +

+
+

Given the following code:

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+

Come up with at least 2 different commands that will produce the +following output:

+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+

After you find 2 different commands, compare notes with your +neighbour. Did you have different strategies?

+
+
+
+
+
+ +
+
+
+

R +

+
+x[2:4]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+

R +

+
+x[-c(1,5)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+

R +

+
+x[c(2,3,4)]
+
+
+

OUTPUT +

+
  b   c   d 
+6.2 7.1 4.8 
+
+
+
+
+
+

Subsetting by name +

+
+

We can extract elements by using their name, instead of extracting by +index:

+
+

R +

+
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
+x[c("a", "c")]
+
+
+

OUTPUT +

+
  a   c 
+5.4 7.1 
+
+

This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same!

+

Subsetting through other logical operations +

+
+

We can also use any logical vector to subset:

+
+

R +

+
+x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
+
+
+

OUTPUT +

+
  c   e 
+7.1 7.5 
+
+

Since comparison operators (e.g. >, +<, ==) evaluate to logical vectors, we can +also use them to succinctly subset vectors: the following statement +gives the same result as the previous one.

+
+

R +

+
+x[x > 7]
+
+
+

OUTPUT +

+
  c   e 
+7.1 7.5 
+
+

Breaking it down, this statement first evaluates x>7, +generating a logical vector +c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the +elements of x corresponding to the TRUE +values.

+

We can use == to mimic the previous method of indexing +by name (remember you have to use == rather than += for comparisons):

+
+

R +

+
+x[names(x) == "a"]
+
+
+

OUTPUT +

+
  a 
+5.4 
+
+
+
+ +
+
+

Tip: Combining logical conditions +

+
+

We often want to combine multiple logical criteria. For example, we +might want to find all the countries that are located in Asia +or Europe and have life expectancies +within a certain range. Several operations for combining logical vectors +exist in R:

+
    +
  • +&, the “logical AND” operator: returns +TRUE if both the left and right are TRUE.
  • +
  • +|, the “logical OR” operator: returns +TRUE, if either the left or right (or both) are +TRUE.
  • +
+

You may sometimes see && and || +instead of & and |. These two-character +operators only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them for programming, i.e. deciding +whether to execute a statement.

+
    +
  • +!, the “logical NOT” operator: converts +TRUE to FALSE and FALSE to +TRUE. It can negate a single logical condition (eg +!TRUE becomes FALSE), or a whole vector of +conditions(eg !c(TRUE, FALSE) becomes +c(FALSE, TRUE)).
  • +
+

Additionally, you can compare the elements within a single vector +using the all function (which returns TRUE if +every element of the vector is TRUE) and the +any function (which returns TRUE if one or +more elements of the vector are TRUE).

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Given the following code:

+
+

R +

+
+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')
+print(x)
+
+
+

OUTPUT +

+
  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+
+

Write a subsetting command to return the values in x that are greater +than 4 and less than 7.

+
+
+
+
+
+ +
+
+
+

R +

+
+x_subset <- x[x<7 & x>4]
+print(x_subset)
+
+
+

OUTPUT +

+
  a   b   d 
+5.4 6.2 4.8 
+
+
+
+
+
+
+
+ +
+
+

Tip: Non-unique names +

+
+

You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have the +same name — although R tries to avoid this — but row names must be +unique.) Consider these examples:

+
+

R +

+
+x <- 1:3
+x
+
+
+

OUTPUT +

+
[1] 1 2 3
+
+
+

R +

+
+names(x) <- c('a', 'a', 'a')
+x
+
+
+

OUTPUT +

+
a a a 
+1 2 3 
+
+
+

R +

+
+x['a']  # only returns first value
+
+
+

OUTPUT +

+
a 
+1 
+
+
+

R +

+
+x[names(x) == 'a']  # returns all three values
+
+
+

OUTPUT +

+
a a a 
+1 2 3 
+
+
+
+
+
+
+ +
+
+

Tip: Getting help for operators +

+
+

Remember you can search for help on operators by wrapping them in +quotes: help("%in%") or ?"%in%".

+
+
+
+

Skipping named elements +

+
+

Skipping or removing named elements is a little harder. If we try to +skip one named element by negating the string, R complains (slightly +obscurely) that it doesn’t know how to take the negative of a +string:

+
+

R +

+
+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
+x[-"a"]
+
+
+

ERROR +

+
Error in -"a": invalid argument to unary operator
+
+

However, we can use the != (not-equals) operator to +construct a logical vector that will do what we want:

+
+

R +

+
+x[names(x) != "a"]
+
+
+

OUTPUT +

+
  b   c   d   e 
+6.2 7.1 4.8 7.5 
+
+

Skipping multiple named indices is a little bit harder still. Suppose +we want to drop the "a" and "c" elements, so +we try this:

+
+

R +

+
+x[names(x)!=c("a","c")]
+
+
+

WARNING +

+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+

OUTPUT +

+
  b   c   d   e 
+6.2 7.1 4.8 7.5 
+
+

R did something, but it gave us a warning that we ought to +pay attention to - and it apparently gave us the wrong answer +(the "c" element is still included in the vector)!

+

So what does != actually do in this case? That’s an +excellent question.

+
+

Recycling +

+

Let’s take a look at the comparison component of this code:

+
+

R +

+
+names(x) != c("a", "c")
+
+
+

WARNING +

+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+

OUTPUT +

+
[1] FALSE  TRUE  TRUE  TRUE  TRUE
+
+

Why does R give TRUE as the third element of this +vector, when names(x)[3] != "c" is obviously false? When +you use !=, R tries to compare each element of the left +argument with the corresponding element of its right argument. What +happens when you compare vectors of different lengths?

+
Inequality testing

When one vector is shorter than the other, it gets +recycled:

+
Inequality testing: results of recycling

In this case R repeats c("a", "c") as +many times as necessary to match names(x), i.e. we get +c("a","c","a","c","a"). Since the recycled "a" +doesn’t match the third element of names(x), the value of +!= is TRUE. Because in this case the longer +vector length (5) isn’t a multiple of the shorter vector length (2), R +printed a warning message. If we had been unlucky and +names(x) had contained six elements, R would +silently have done the wrong thing (i.e., not what we intended +it to do). This recycling rule can can introduce hard-to-find and subtle +bugs!

+

The way to get R to do what we really want (match each +element of the left argument with all of the elements of the +right argument) it to use the %in% operator. The +%in% operator goes through each element of its left +argument, in this case the names of x, and asks, “Does this +element occur in the second argument?”. Here, since we want to +exclude values, we also need a ! operator to +change “in” to “not in”:

+
+

R +

+
+x[! names(x) %in% c("a","c") ]
+
+
+

OUTPUT +

+
  b   d   e 
+6.2 4.8 7.5 
+
+
+
+ +
+
+

Challenge 3 +

+
+

Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains country and continent variables, but +no information between these two scales. Suppose we want to pull out +information from southeast Asia: how do we set up an operation to +produce a logical vector that is TRUE for all of the +countries in southeast Asia and FALSE otherwise?

+

Suppose you have these data:

+
+

R +

+
+seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries <- unique(as.character(gapminder$country))
+
+

There’s a wrong way (using only ==), which will give you +a warning; a clunky way (using the logical operators == and +|); and an elegant way (using %in%). See +whether you can come up with all three and explain how they (don’t) +work.

+
+
+
+
+
+ +
+
+
    +
  • The wrong way to do this problem is +countries==seAsia. This gives a warning +("In countries == seAsia : longer object length is not a multiple of shorter object length") +and the wrong answer (a vector of all FALSE values), +because none of the recycled values of seAsia happen to +line up correctly with matching values in country.
  • +
  • The clunky (but technically correct) way to do this +problem is
  • +
+
+

R +

+
+ (countries=="Myanmar" | countries=="Thailand" |
+ countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")
+
+

(or countries==seAsia[1] | countries==seAsia[2] | ...). +This gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?).

+
    +
  • The best way to do this problem is +countries %in% seAsia, which is both correct and easy to +type (and read).
  • +
+
+
+
+
+
+

Handling special values +

+
+

At some point you will encounter functions in R that cannot handle +missing, infinite, or undefined data.

+

There are a number of special functions you can use to filter out +this data:

+
    +
  • +is.na will return all positions in a vector, matrix, or +data.frame containing NA (or NaN)
  • +
  • likewise, is.nan, and is.infinite will do +the same for NaN and Inf.
  • +
  • +is.finite will return all positions in a vector, +matrix, or data.frame that do not contain NA, +NaN or Inf.
  • +
  • +na.omit will filter out all missing values from a +vector
  • +

Factor subsetting +

+
+

Now that we’ve explored the different ways to subset vectors, how do +we subset the other data structures?

+

Factor subsetting works the same way as vector subsetting.

+
+

R +

+
+f <- factor(c("a", "a", "b", "c", "c", "d"))
+f[f == "a"]
+
+
+

OUTPUT +

+
[1] a a
+Levels: a b c d
+
+
+

R +

+
+f[f %in% c("b", "c")]
+
+
+

OUTPUT +

+
[1] b c c
+Levels: a b c d
+
+
+

R +

+
+f[1:3]
+
+
+

OUTPUT +

+
[1] a a b
+Levels: a b c d
+
+

Skipping elements will not remove the level even if no more of that +category exists in the factor:

+
+

R +

+
+f[-3]
+
+
+

OUTPUT +

+
[1] a a c c d
+Levels: a b c d
+
+

Matrix subsetting +

+
+

Matrices are also subsetted using the [ function. In +this case it takes two arguments: the first applying to the rows, the +second to its columns:

+
+

R +

+
+set.seed(1)
+m <- matrix(rnorm(6*4), ncol=4, nrow=6)
+m[3:4, c(3,1)]
+
+
+

OUTPUT +

+
            [,1]       [,2]
+[1,]  1.12493092 -0.8356286
+[2,] -0.04493361  1.5952808
+
+

You can leave the first or second arguments blank to retrieve all the +rows or columns respectively:

+
+

R +

+
+m[, c(3,4)]
+
+
+

OUTPUT +

+
            [,1]        [,2]
+[1,] -0.62124058  0.82122120
+[2,] -2.21469989  0.59390132
+[3,]  1.12493092  0.91897737
+[4,] -0.04493361  0.78213630
+[5,] -0.01619026  0.07456498
+[6,]  0.94383621 -1.98935170
+
+

If we only access one row or column, R will automatically convert the +result to a vector:

+
+

R +

+
+m[3,]
+
+
+

OUTPUT +

+
[1] -0.8356286  0.5757814  1.1249309  0.9189774
+
+

If you want to keep the output as a matrix, you need to specify a +third argument; drop = FALSE:

+
+

R +

+
+m[3, , drop=FALSE]
+
+
+

OUTPUT +

+
           [,1]      [,2]     [,3]      [,4]
+[1,] -0.8356286 0.5757814 1.124931 0.9189774
+
+

Unlike vectors, if we try to access a row or column outside of the +matrix, R will throw an error:

+
+

R +

+
+m[, c(3,6)]
+
+
+

ERROR +

+
Error in m[, c(3, 6)]: subscript out of bounds
+
+
+
+ +
+
+

Tip: Higher dimensional arrays +

+
+

when dealing with multi-dimensional arrays, each argument to +[ corresponds to a dimension. For example, a 3D array, the +first three arguments correspond to the rows, columns, and depth +dimension.

+
+
+
+

Because matrices are vectors, we can also subset using only one +argument:

+
+

R +

+
+m[5]
+
+
+

OUTPUT +

+
[1] 0.3295078
+
+

This usually isn’t useful, and often confusing to read. However it is +useful to note that matrices are laid out in column-major +format by default. That is the elements of the vector are arranged +column-wise:

+
+

R +

+
+matrix(1:6, nrow=2, ncol=3)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3]
+[1,]    1    3    5
+[2,]    2    4    6
+
+

If you wish to populate the matrix by row, use +byrow=TRUE:

+
+

R +

+
+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3]
+[1,]    1    2    3
+[2,]    4    5    6
+
+

Matrices can also be subsetted using their rownames and column names +instead of their row and column indices.

+
+
+ +
+
+

Challenge 4 +

+
+

Given the following code:

+
+

R +

+
+m <- matrix(1:18, nrow=3, ncol=6)
+print(m)
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    1    4    7   10   13   16
+[2,]    2    5    8   11   14   17
+[3,]    3    6    9   12   15   18
+
+
    +
  1. Which of the following commands will extract the values 11 and +14?
  2. +
+

A. m[2,4,2,5]

+

B. m[2:5]

+

C. m[4:5,2]

+

D. m[2,c(4,5)]

+
+
+
+
+
+ +
+
+

D

+
+
+
+
+

List subsetting +

+
+

Now we’ll introduce some new subsetting operators. There are three +functions used to subset lists. We’ve already seen these when learning +about atomic vectors and matrices: [, [[, and +$.

+

Using [ will always return a list. If you want to +subset a list, but not extract an element, then you +will likely use [.

+
+

R +

+
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+xlist[1]
+
+
+

OUTPUT +

+
$a
+[1] "Software Carpentry"
+
+

This returns a list with one element.

+

We can subset elements of a list exactly the same way as atomic +vectors using [. Comparison operations however won’t work +as they’re not recursive, they will try to condition on the data +structures in each element of the list, not the individual elements +within those data structures.

+
+

R +

+
+xlist[1:2]
+
+
+

OUTPUT +

+
$a
+[1] "Software Carpentry"
+
+$b
+ [1]  1  2  3  4  5  6  7  8  9 10
+
+

To extract individual elements of a list, you need to use the +double-square bracket function: [[.

+
+

R +

+
+xlist[[1]]
+
+
+

OUTPUT +

+
[1] "Software Carpentry"
+
+

Notice that now the result is a vector, not a list.

+

You can’t extract more than one element at once:

+
+

R +

+
+xlist[[1:2]]
+
+
+

ERROR +

+
Error in xlist[[1:2]]: subscript out of bounds
+
+

Nor use it to skip elements:

+
+

R +

+
+xlist[[-1]]
+
+
+

ERROR +

+
Error in xlist[[-1]]: invalid negative subscript in get1index <real>
+
+

But you can use names to both subset and extract elements:

+
+

R +

+
+xlist[["a"]]
+
+
+

OUTPUT +

+
[1] "Software Carpentry"
+
+

The $ function is a shorthand way for extracting +elements by name:

+
+

R +

+
+xlist$data
+
+
+

OUTPUT +

+
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
+Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+
+
+
+ +
+
+

Challenge 5 +

+
+

Given the following list:

+
+

R +

+
+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
+
+

Using your knowledge of both list and vector subsetting, extract the +number 2 from xlist. Hint: the number 2 is contained within the “b” item +in the list.

+
+
+
+
+
+ +
+
+
+

R +

+
+xlist$b[2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+

R +

+
+xlist[[2]][2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+

R +

+
+xlist[["b"]][2]
+
+
+

OUTPUT +

+
[1] 2
+
+
+
+
+
+
+
+ +
+
+

Challenge 6 +

+
+

Given a linear model:

+
+

R +

+
+mod <- aov(pop ~ lifeExp, data=gapminder)
+
+

Extract the residual degrees of freedom (hint: +attributes() will help you)

+
+
+
+
+
+ +
+
+
+

R +

+
+attributes(mod) ## `df.residual` is one of the names of `mod`
+
+
+

R +

+
+mod$df.residual
+
+
+
+
+
+

Data frames +

+
+

Remember the data frames are lists underneath the hood, so similar +rules apply. However they are also two dimensional objects:

+

[ with one argument will act the same way as for lists, +where each list element corresponds to a column. The resulting object +will be a data frame:

+
+

R +

+
+head(gapminder[3])
+
+
+

OUTPUT +

+
       pop
+1  8425333
+2  9240934
+3 10267083
+4 11537966
+5 13079460
+6 14880372
+
+

Similarly, [[ will act to extract a single +column:

+
+

R +

+
+head(gapminder[["lifeExp"]])
+
+
+

OUTPUT +

+
[1] 28.801 30.332 31.997 34.020 36.088 38.438
+
+

And $ provides a convenient shorthand to extract columns +by name:

+
+

R +

+
+head(gapminder$year)
+
+
+

OUTPUT +

+
[1] 1952 1957 1962 1967 1972 1977
+
+

With two arguments, [ behaves the same way as for +matrices:

+
+

R +

+
+gapminder[1:3,]
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+
+

If we subset a single row, the result will be a data frame (because +the elements are mixed types):

+
+

R +

+
+gapminder[3,]
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+
+

But for a single column the result will be a vector (this can be +changed with the third argument, drop = FALSE).

+
+
+ +
+
+

Challenge 7 +

+
+

Fix each of the following common data frame subsetting errors:

+
    +
  1. Extract observations collected for the year 1957
  2. +
+
+

R +

+
gapminder[gapminder$year = 1957,]
+
+
    +
  1. Extract all columns except 1 through to 4
  2. +
+
+

R +

+
+gapminder[,-1:4]
+
+
    +
  1. Extract the rows where the life expectancy is longer the 80 +years
  2. +
+
+

R +

+
+gapminder[gapminder$lifeExp > 80]
+
+
    +
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +
+
+

R +

+
+gapminder[1, 4, 5]
+
+
    +
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +
+
+

R +

+
+gapminder[gapminder$year == 2002 | 2007,]
+
+
+
+
+
+
+ +
+
+

Fix each of the following common data frame subsetting errors:

+
    +
  1. Extract observations collected for the year 1957
  2. +
+
+

R +

+
+# gapminder[gapminder$year = 1957,]
+gapminder[gapminder$year == 1957,]
+
+
    +
  1. Extract all columns except 1 through to 4
  2. +
+
+

R +

+
+# gapminder[,-1:4]
+gapminder[,-c(1:4)]
+
+
    +
  1. Extract the rows where the life expectancy is longer than 80 +years
  2. +
+
+

R +

+
+# gapminder[gapminder$lifeExp > 80]
+gapminder[gapminder$lifeExp > 80,]
+
+
    +
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +
+
+

R +

+
+# gapminder[1, 4, 5]
+gapminder[1, c(4, 5)]
+
+
    +
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +
+
+

R +

+
+# gapminder[gapminder$year == 2002 | 2007,]
+gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
+gapminder[gapminder$year %in% c(2002, 2007),]
+
+
+
+
+
+
+
+ +
+
+

Challenge 8 +

+
+
    +
  1. Why does gapminder[1:20] return an error? How does +it differ from gapminder[1:20, ]?

  2. +
  3. Create a new data.frame called +gapminder_small that only contains rows 1 through 9 and 19 +through 23. You can do this in one or two steps.

  4. +
+
+
+
+
+
+ +
+
+
    +
  1. gapminder is a data.frame so needs to be subsetted +on two dimensions. gapminder[1:20, ] subsets the data to +give the first 20 rows and all columns.

  2. +
  3. +
  4. +
+
+

R +

+
+gapminder_small <- gapminder[c(1:9, 19:23),]
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +
+
+
+
+

Content from Control Flow

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 65 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I make data-dependent choices in R?
  • +
  • How can I repeat operations in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Write conditional statements with if...else statements +and ifelse().
  • +
  • Write and understand for() loops.
  • +
+
+
+
+
+
+

Often when we’re coding we want to control the flow of our actions. +This can be done by setting actions to occur only if a condition or a +set of conditions are met. Alternatively, we can also set an action to +occur a particular number of times.

+

There are several ways you can control flow in R. For conditional +statements, the most commonly used approaches are the constructs:

+
+

R +

+
# if
+if (condition is true) {
+  perform action
+}
+
+# if ... else
+if (condition is true) {
+  perform action
+} else {  # that is, if the condition is false,
+  perform alternative action
+}
+
+

Say, for example, that we want R to print a message if a variable +x has a particular value:

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+}
+
+x
+
+
+

OUTPUT +

+
[1] 8
+
+

The print statement does not appear in the console because x is not +greater than 10. To print a different message for numbers less than 10, +we can add an else statement.

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else {
+  print("x is less than 10")
+}
+
+
+

OUTPUT +

+
[1] "x is less than 10"
+
+

You can also test multiple conditions by using +else if.

+
+

R +

+
+x <- 8
+
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else if (x > 5) {
+  print("x is greater than 5, but less than 10")
+} else {
+  print("x is less than 5")
+}
+
+
+

OUTPUT +

+
[1] "x is greater than 5, but less than 10"
+
+

Important: when R evaluates the condition inside +if() statements, it is looking for a logical element, i.e., +TRUE or FALSE. This can cause some headaches +for beginners. For example:

+
+

R +

+
+x  <-  4 == 3
+if (x) {
+  "4 equals 3"
+} else {
+  "4 does not equal 3"
+}
+
+
+

OUTPUT +

+
[1] "4 does not equal 3"
+
+

As we can see, the not equal message was printed because the vector x +is FALSE

+
+

R +

+
+x <- 4 == 3
+x
+
+
+

OUTPUT +

+
[1] FALSE
+
+
+
+ +
+
+

Challenge 1 +

+
+

Use an if() statement to print a suitable message +reporting whether there are any records from 2002 in the +gapminder dataset. Now do the same for 2012.

+
+
+
+
+
+ +
+
+

We will first see a solution to Challenge 1 which does not use the +any() function. We first obtain a logical vector describing +which element of gapminder$year is equal to +2002:

+
+

R +

+
+gapminder[(gapminder$year == 2002),]
+
+

Then, we count the number of rows of the data.frame +gapminder that correspond to the 2002:

+
+

R +

+
+rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])
+
+

The presence of any record for the year 2002 is equivalent to the +request that rows2002_number is one or more:

+
+

R +

+
+rows2002_number >= 1
+
+

Putting all together, we obtain:

+
+

R +

+
+if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
+   print("Record(s) for the year 2002 found.")
+}
+
+

All this can be done more quickly with any(). The +logical condition can be expressed as:

+
+

R +

+
+if(any(gapminder$year == 2002)){
+   print("Record(s) for the year 2002 found.")
+}
+
+
+
+
+
+

Did anyone get a warning message like this?

+
+

ERROR +

+
Error in if (gapminder$year == 2012) {: the condition has length > 1
+
+

The if() function only accepts singular (of length 1) +inputs, and therefore returns an error when you use it with a vector. +The if() function will still run, but will only evaluate +the condition in the first element of the vector. Therefore, to use the +if() function, you need to make sure your input is singular +(of length 1).

+
+
+ +
+
+

Tip: Built in ifelse() +function +

+
+

R accepts both if() and +else if() statements structured as outlined above, but also +statements using R’s built-in ifelse() +function. This function accepts both singular and vector inputs and is +structured as follows:

+
+

R +

+
# ifelse function
+ifelse(condition is true, perform action, perform alternative action)
+
+

where the first argument is the condition or a set of conditions to +be met, the second argument is the statement that is evaluated when the +condition is TRUE, and the third statement is the statement +that is evaluated when the condition is FALSE.

+
+

R +

+
+y <- -3
+ifelse(y < 0, "y is a negative number", "y is either positive or zero")
+
+
+

OUTPUT +

+
[1] "y is a negative number"
+
+
+
+
+
+
+ +
+
+

Tip: any() and +all() +

+
+

The any() function will return TRUE if at +least one TRUE value is found within a vector, otherwise it +will return FALSE. This can be used in a similar way to the +%in% operator. The function all(), as the name +suggests, will only return TRUE if all values in the vector +are TRUE.

+
+
+
+

Repeating operations +

+
+

If you want to iterate over a set of values, when the order of +iteration is important, and perform the same operation on each, a +for() loop will do the job. We saw for() loops +in the shell +lessons earlier. This is the most flexible of looping operations, +but therefore also the hardest to use correctly. In general, the advice +of many R users would be to learn about for() +loops, but to avoid using for() loops unless the order of +iteration is important: i.e. the calculation at each iteration depends +on the results of previous iterations. If the order of iteration is not +important, then you should learn about vectorized alternatives, such as +the purrr package, as they pay off in computational +efficiency.

+

The basic structure of a for() loop is:

+
+

R +

+
for (iterator in set of values) {
+  do a thing
+}
+
+

For example:

+
+

R +

+
+for (i in 1:10) {
+  print(i)
+}
+
+
+

OUTPUT +

+
[1] 1
+[1] 2
+[1] 3
+[1] 4
+[1] 5
+[1] 6
+[1] 7
+[1] 8
+[1] 9
+[1] 10
+
+

The 1:10 bit creates a vector on the fly; you can +iterate over any other vector as well.

+

We can use a for() loop nested within another +for() loop to iterate over two things at once.

+
+

R +

+
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    print(paste(i,j))
+  }
+}
+
+
+

OUTPUT +

+
[1] "1 a"
+[1] "1 b"
+[1] "1 c"
+[1] "1 d"
+[1] "1 e"
+[1] "2 a"
+[1] "2 b"
+[1] "2 c"
+[1] "2 d"
+[1] "2 e"
+[1] "3 a"
+[1] "3 b"
+[1] "3 c"
+[1] "3 d"
+[1] "3 e"
+[1] "4 a"
+[1] "4 b"
+[1] "4 c"
+[1] "4 d"
+[1] "4 e"
+[1] "5 a"
+[1] "5 b"
+[1] "5 c"
+[1] "5 d"
+[1] "5 e"
+
+

We notice in the output that when the first index (i) is +set to 1, the second index (j) iterates through its full +set of indices. Once the indices of j have been iterated +through, then i is incremented. This process continues +until the last index has been used for each for() loop.

+

Rather than printing the results, we could write the loop output to a +new object.

+
+

R +

+
+output_vector <- c()
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    temp_output <- paste(i, j)
+    output_vector <- c(output_vector, temp_output)
+  }
+}
+output_vector
+
+
+

OUTPUT +

+
 [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b"
+[13] "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b" "5 c" "5 d"
+[25] "5 e"
+
+

This approach can be useful, but ‘growing your results’ (building the +result object incrementally) is computationally inefficient, so avoid it +when you are iterating through a lot of values.

+
+
+ +
+
+

Tip: don’t grow your results +

+
+

One of the biggest things that trips up novices and experienced R +users alike, is building a results object (vector, list, matrix, data +frame) as your for loop progresses. Computers are very bad at handling +this, so your calculations can very quickly slow to a crawl. It’s much +better to define an empty results object before hand of appropriate +dimensions, rather than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location.

+
+
+
+

A better way is to define your (empty) output object before filling +in the values. For this example, it looks more involved, but is still +more efficient.

+
+

R +

+
+output_matrix <- matrix(nrow = 5, ncol = 5)
+j_vector <- c('a', 'b', 'c', 'd', 'e')
+for (i in 1:5) {
+  for (j in 1:5) {
+    temp_j_value <- j_vector[j]
+    temp_output <- paste(i, temp_j_value)
+    output_matrix[i, j] <- temp_output
+  }
+}
+output_vector2 <- as.vector(output_matrix)
+output_vector2
+
+
+

OUTPUT +

+
 [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c" "2 c"
+[13] "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e" "3 e" "4 e"
+[25] "5 e"
+
+
+
+ +
+
+

Tip: While loops +

+
+

Sometimes you will find yourself needing to repeat an operation as +long as a certain condition is met. You can do this with a +while() loop.

+
+

R +

+
while(this condition is true){
+  do a thing
+}
+
+

R will interpret a condition being met as “TRUE”.

+

As an example, here’s a while loop that generates random numbers from +a uniform distribution (the runif() function) between 0 and +1 until it gets one that’s less than 0.1.

+
+

R +

+
+z <- 1
+while(z > 0.1){
+  z <- runif(1)
+  cat(z, "\n")
+}
+
+

while() loops will not always be appropriate. You have +to be particularly careful that you don’t end up stuck in an infinite +loop because your condition is always met and hence the while statement +never terminates.

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Compare the objects output_vector and +output_vector2. Are they the same? If not, why not? How +would you change the last block of code to make +output_vector2 the same as output_vector?

+
+
+
+
+
+ +
+
+

We can check whether the two vectors are identical using the +all() function:

+
+

R +

+
+all(output_vector == output_vector2)
+
+

However, all the elements of output_vector can be found +in output_vector2:

+
+

R +

+
+all(output_vector %in% output_vector2)
+
+

and vice versa:

+
+

R +

+
+all(output_vector2 %in% output_vector)
+
+

therefore, the element in output_vector and +output_vector2 are just sorted in a different order. This +is because as.vector() outputs the elements of an input +matrix going over its column. Taking a look at +output_matrix, we can notice that we want its elements by +rows. The solution is to transpose the output_matrix. We +can do it either by calling the transpose function t() or +by inputting the elements in the right order. The first solution +requires to change the original

+
+

R +

+
+output_vector2 <- as.vector(output_matrix)
+
+

into

+
+

R +

+
+output_vector2 <- as.vector(t(output_matrix))
+
+

The second solution requires to change

+
+

R +

+
+output_matrix[i, j] <- temp_output
+
+

into

+
+

R +

+
+output_matrix[j, i] <- temp_output
+
+
+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Write a script that loops through the gapminder data by +continent and prints out whether the mean life expectancy is smaller or +larger than 50 years.

+
+
+
+
+
+ +
+
+

Step 1: We want to make sure we can extract all the +unique values of the continent vector

+
+

R +

+
+gapminder <- read.csv("data/gapminder_data.csv")
+unique(gapminder$continent)
+
+

Step 2: We also need to loop over each of these +continents and calculate the average life expectancy for each +subset of data. We can do that as follows:

+
    +
  1. Loop over each of the unique values of ‘continent’
  2. +
  3. For each value of continent, create a temporary variable storing +that subset
  4. +
  5. Return the calculated life expectancy to the user by printing the +output:
  6. +
+
+

R +

+
+for (iContinent in unique(gapminder$continent)) {
+  tmp <- gapminder[gapminder$continent == iContinent, ]
+  cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n")
+  rm(tmp)
+}
+
+

Step 3: The exercise only wants the output printed +if the average life expectancy is less than 50 or greater than 50. So we +need to add an if() condition before printing, which +evaluates whether the calculated average life expectancy is above or +below a threshold, and prints an output conditional on the result. We +need to amend (3) from above:

+

3a. If the calculated life expectancy is less than some threshold (50 +years), return the continent and a statement that life expectancy is +less than threshold, otherwise return the continent and a statement that +life expectancy is greater than threshold:

+
+

R +

+
+thresholdValue <- 50
+
+for (iContinent in unique(gapminder$continent)) {
+   tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"])
+
+   if (tmp < thresholdValue){
+       cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+   } else {
+       cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+   } # end if else condition
+   rm(tmp)
+} # end for loop
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

Modify the script from Challenge 3 to loop over each country. This +time print out whether the life expectancy is smaller than 50, between +50 and 70, or greater than 70.

+
+
+
+
+
+ +
+
+

We modify our solution to Challenge 3 by now adding two thresholds, +lowerThreshold and upperThreshold and +extending our if-else statements:

+
+

R +

+
+ lowerThreshold <- 50
+ upperThreshold <- 70
+
+for (iCountry in unique(gapminder$country)) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+    if(tmp < lowerThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
+    } else if(tmp > lowerThreshold && tmp < upperThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
+    } else {
+        cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
+    }
+    rm(tmp)
+}
+
+
+
+
+
+
+
+ +
+
+

Challenge 5 - Advanced +

+
+

Write a script that loops over each country in the +gapminder dataset, tests whether the country starts with a +‘B’, and graphs life expectancy against time as a line graph if the mean +life expectancy is under 50 years.

+
+
+
+
+
+ +
+
+

We will use the grep() command that was introduced in +the Unix +Shell lesson to find countries that start with “B.” Lets understand +how to do this first. Following from the Unix shell section we may be +tempted to try the following

+
+

R +

+
+grep("^B", unique(gapminder$country))
+
+

But when we evaluate this command it returns the indices of the +factor variable country that start with “B.” To get the +values, we must add the value=TRUE option to the +grep() command:

+
+

R +

+
+grep("^B", unique(gapminder$country), value = TRUE)
+
+

We will now store these countries in a variable called +candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each +country, and if the average life expectancy is less than 50 we use +base-plot to plot the evolution of average life expectancy using +with() and subset():

+
+

R +

+
+thresholdValue <- 50
+candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE)
+
+for (iCountry in candidateCountries) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+
+    if (tmp < thresholdValue) {
+        cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+
+        with(subset(gapminder, country == iCountry),
+                plot(year, lifeExp,
+                     type = "o",
+                     main = paste("Life Expectancy in", iCountry, "over time"),
+                     ylab = "Life Expectancy",
+                     xlab = "Year"
+                     ) # end plot
+             ) # end with
+    } # end if
+    rm(tmp)
+} # end for loop
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +
+
+
+
+

Content from Creating Publication-Quality Graphics with ggplot2

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 80 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I create publication-quality graphics in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to use ggplot2 to generate publication-quality +graphics.
  • +
  • To apply geometry, aesthetic, and statistics layers to a ggplot +plot.
  • +
  • To manipulate the aesthetics of a plot using different colors, +shapes, and lines.
  • +
  • To improve data visualization through transforming scales and +paneling by group.
  • +
  • To save a plot created with ggplot to disk.
  • +
+
+
+
+
+
+

Plotting our data is one of the best ways to quickly explore it and +the various relationships between variables.

+

There are three main plotting systems in R, the base plotting +system, the lattice +package, and the ggplot2 +package.

+

Today we’ll be learning about the ggplot2 package, because it is the +most effective for creating publication-quality graphics.

+

ggplot2 is built on the grammar of graphics, the idea that any plot +can be built from the same set of components: a data +set, mapping aesthetics, and graphical +layers:

+
    +
  • Data sets are the data that you, the user, +provide.

  • +
  • Mapping aesthetics are what connect the data to +the graphics. They tell ggplot2 how to use your data to affect how the +graph looks, such as changing what is plotted on the X or Y axis, or the +size or color of different data points.

  • +
  • Layers are the actual graphical output from +ggplot2. Layers determine what kinds of plot are shown (scatterplot, +histogram, etc.), the coordinate system used (rectangular, polar, +others), and other important aspects of the plot. The idea of layers of +graphics may be familiar to you if you have used image editing programs +like Photoshop, Illustrator, or Inkscape.

  • +
+

Let’s start off building an example using the gapminder data from +earlier. The most basic function is ggplot, which lets R +know that we’re creating a new plot. Any of the arguments we give the +ggplot function are the global options for the +plot: they apply to all layers on the plot.

+
+

R +

+
+library("ggplot2")
+ggplot(data = gapminder)
+
+
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to +show on our figure. This is not enough information for +ggplot to actually draw anything. It only creates a blank +slate for other elements to be added to.

+

Now we’re going to add in the mapping aesthetics +using the aes function. aes tells +ggplot how variables in the data map to +aesthetic properties of the figure, such as which columns of +the data should be used for the x and +y locations.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
+
+
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

Here we told ggplot we want to plot the “gdpPercap” +column of the gapminder data frame on the x-axis, and the “lifeExp” +column on the y-axis. Notice that we didn’t need to explicitly pass +aes these columns +(e.g. x = gapminder[, "gdpPercap"]), this is because +ggplot is smart enough to know to look in the +data for that column!

+

The final part of making our plot is to tell ggplot how +we want to visually represent the data. We do this by adding a new +layer to the plot using one of the +geom functions.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
+
+
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

Here we used geom_point, which tells ggplot +we want to visually represent the relationship between +x and y as a scatterplot of +points.

+
+
+ +
+
+

Challenge 1 +

+
+

Modify the example so that the figure shows how life expectancy has +changed over time:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()
+
+

Hint: the gapminder dataset has a column called “year”, which should +appear on the x-axis.

+
+
+
+
+
+ +
+
+

Here is one possible solution:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
+
+
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +
+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

In the previous examples and challenge we’ve used the +aes function to tell the scatterplot geom +about the x and y locations of each +point. Another aesthetic property we can modify is the point +color. Modify the code from the previous challenge to +color the points by the “continent” column. What trends +do you see in the data? Are they what you expected?

+
+
+
+
+
+ +
+
+

The solution presented below adds color=continent to the +call of the aes function. The general trend seems to +indicate an increased life expectancy over the years. On continents with +stronger economies we find a longer life expectancy.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_point()
+
+
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +
+
+
+
+
+

Layers +

+
+

Using a scatterplot probably isn’t the best for visualizing change +over time. Instead, let’s tell ggplot to visualize the data +as a line plot:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
+  geom_line()
+
+

Instead of adding a geom_point layer, we’ve added a +geom_line layer.

+

However, the result doesn’t look quite as we might have expected: it +seems to be jumping around a lot in each continent. Let’s try to +separate the data by country, plotting one line for each country:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line()
+
+

We’ve added the group aesthetic, which +tells ggplot to draw a line for each country.

+

But what if we want to visualize both lines and points on the plot? +We can add another layer to the plot:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line() + geom_point()
+
+

It’s important to note that each layer is drawn on top of the +previous layer. In this example, the points have been drawn on top +of the lines. Here’s a demonstration:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+  geom_line(mapping = aes(color=continent)) + geom_point()
+
+

In this example, the aesthetic mapping of +color has been moved from the global plot options in +ggplot to the geom_line layer so it no longer +applies to the points. Now we can clearly see that the points are drawn +on top of the lines.

+
+
+ +
+
+

Tip: Setting an aesthetic to a value instead +of a mapping +

+
+

So far, we’ve seen how to use an aesthetic (such as +color) as a mapping to a variable in the data. +For example, when we use +geom_line(mapping = aes(color=continent)), ggplot will give +a different color to each continent. But what if we want to change the +color of all lines to blue? You may think that +geom_line(mapping = aes(color="blue")) should work, but it +doesn’t. Since we don’t want to create a mapping to a specific variable, +we can move the color specification outside of the aes() +function, like this: geom_line(color="blue").

+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Switch the order of the point and line layers from the previous +example. What happened?

+
+
+
+
+
+ +
+
+

The lines now get drawn over the points!

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_point() + geom_line(mapping = aes(color=continent))
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+
+
+
+
+

Transformations and statistics +

+
+

ggplot2 also makes it easy to overlay statistical models over the +data. To demonstrate we’ll go back to our first example:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
+
+

Currently it’s hard to see the relationship between the points due to +some strong outliers in GDP per capita. We can change the scale of units +on the x axis using the scale functions. These control the +mapping between the data values and visual values of an aesthetic. We +can also modify the transparency of the points, using the alpha +function, which is especially helpful when you have a large amount of +data which is very clustered.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10()
+
+
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

The scale_x_log10 function applied a transformation to +the coordinate system of the plot, so that each multiple of 10 is evenly +spaced from left to right. For example, a GDP per capita of 1,000 is the +same horizontal distance away from a value of 10,000 as the 10,000 value +is from 100,000. This helps to visualize the spread of the data along +the x-axis.

+
+
+ +
+
+

Tip Reminder: Setting an aesthetic to a value +instead of a mapping +

+
+

Notice that we used geom_point(alpha = 0.5). As the +previous tip mentioned, using a setting outside of the +aes() function will cause this value to be used for all +points, which is what we want in this case. But just like any other +aesthetic setting, alpha can also be mapped to a variable in +the data. For example, we can give a different transparency to each +continent with +geom_point(mapping = aes(alpha = continent)).

+
+
+
+

We can fit a simple relationship to the data by adding another layer, +geom_smooth:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the +size aesthetic in the geom_smooth +layer:

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", size=1.5)
+
+
+

WARNING +

+
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+generated.
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we +set the size aesthetic by passing it as an +argument to geom_smooth. Previously in the lesson we’ve +used the aes function to define a mapping between +data variables and their visual representation.

+
+
+ +
+
+

Challenge 4a +

+
+

Modify the color and size of the points on the point layer in the +previous example.

+

Hint: do not use the aes function.

+
+
+
+
+
+ +
+
+

Here a possible solution: Notice that the color argument +is supplied outside of the aes() function. This means that +it applies to all data points on the graph and is not related to a +specific variable.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(size=3, color="orange") + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+
+
+
+
+
+
+ +
+
+

Challenge 4b +

+
+

Modify your solution to Challenge 4a so that the points are now a +different shape and are colored by continent with new trendlines. Hint: +The color argument can be used inside the aesthetic.

+
+
+
+
+
+ +
+
+

Here is a possible solution: Notice that supplying the +color argument inside the aes() functions +enables you to connect it to a certain variable. The shape +argument, as you can see, modifies all data points the same way (it is +outside the aes() call) while the color +argument which is placed inside the aes() call modifies a +point’s color based on its continent value.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
+ geom_point(size=3, shape=17) + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)
+
+
+

OUTPUT +

+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
+
+
+
+

Multi-panel figures +

+
+

Earlier we visualized the change in life expectancy over time across +all countries in one plot. Alternatively, we can split this out over +multiple panels by adding a layer of facet panels.

+
+
+ +
+
+

Tip +

+
+

We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to clutter +the figure. Note that we apply a “theme” definition to rotate the x-axis +labels to maintain readability. Nearly everything in ggplot2 is +customizable.

+
+
+
+
+

R +

+
+americas <- gapminder[gapminder$continent == "Americas",]
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

The facet_wrap layer took a “formula” as its argument, +denoted by the tilde (~). This tells R to draw a panel for each unique +value in the country column of the gapminder dataset.

+

Modifying text +

+
+

To clean this figure up for a publication we need to change some of +the text elements. The x-axis is too cluttered, and the y axis should +read “Life expectancy”, rather than the column name in the data +frame.

+

We can do this by adding a couple of different layers. The +theme layer controls the axis text, and overall text +size. Labels for the axes, plot title and any legend can be set using +the labs function. Legend titles are set using the same +names we used in the aes specification. Thus below the +color legend title is set using color = "Continent", while +the title of a fill legend would be set using +fill = "MyTitle".

+
+

R +

+
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+
+

Exporting the plot +

+
+

The ggsave() function allows you to export a plot +created with ggplot. You can specify the dimension and resolution of +your plot by adjusting the appropriate arguments (width, +height and dpi) to create high quality +graphics for publication. In order to save the plot from above, we first +assign it to a variable lifeExp_plot, then tell +ggsave to save that plot in png format to a +directory called results. (Make sure you have a +results/ folder in your working directory.)

+
+

R +

+
+lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+
+ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")
+
+

There are two nice things about ggsave. First, it +defaults to the last plot, so if you omit the plot argument +it will automatically save the last plot you created with +ggplot. Secondly, it tries to determine the format you want +to save your plot in from the file extension you provide for the +filename (for example .png or .pdf). If you +need to, you can specify the format explicitly in the +device argument.

+

This is a taste of what you can do with ggplot2. RStudio provides a +really useful cheat +sheet of the different layers available, and more extensive +documentation is available on the ggplot2 website. All +RStudio cheat sheets can be found here. Finally, +if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow +with reusable code to modify!

+
+
+ +
+
+

Challenge 5 +

+
+

Generate boxplots to compare life expectancy between the different +continents during the available years.

+

Advanced:

+
    +
  • Rename y axis as Life Expectancy.
  • +
  • Remove x axis labels.
  • +
+
+
+
+
+
+ +
+
+

Here a possible solution: xlab() and ylab() +set labels for the x and y axes, respectively The axis title, text and +ticks are attributes of the theme and must be modified within a +theme() call.

+
+

R +

+
+ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
+ geom_boxplot() + facet_wrap(~year) +
+ ylab("Life Expectancy") +
+ theme(axis.title.x=element_blank(),
+       axis.text.x = element_blank(),
+       axis.ticks.x = element_blank())
+
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +
+
+
+
+

Content from Vectorization

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 25 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I operate on all the elements of a vector at once?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To understand vectorized operations in R.
  • +
+
+
+
+
+
+

Most of R’s functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through and +act on each element one at a time. This makes writing code more concise, +easy to read, and less error prone.

+
+

R +

+
+x <- 1:4
+x * 2
+
+
+

OUTPUT +

+
[1] 2 4 6 8
+
+

The multiplication happened to each element of the vector.

+

We can also add two vectors together:

+
+

R +

+
+y <- 6:9
+x + y
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+

Each element of x was added to its corresponding element +of y:

+
+

R +

+
x:  1  2  3  4
+    +  +  +  +
+y:  6  7  8  9
+---------------
+    7  9 11 13
+
+

Here is how we would add two vectors together using a for loop:

+
+

R +

+
+output_vector <- c()
+for (i in 1:4) {
+  output_vector[i] <- x[i] + y[i]
+}
+output_vector
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+

Compare this to the output using vectorised operations.

+
+

R +

+
+sum_xy <- x + y
+sum_xy
+
+
+

OUTPUT +

+
[1]  7  9 11 13
+
+
+
+ +
+
+

Challenge 1 +

+
+

Let’s try this on the pop column of the +gapminder dataset.

+

Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+
+
+
+
+
+ +
+
+

Let’s try this on the pop column of the +gapminder dataset.

+

Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+
+

R +

+
+gapminder$pop_millions <- gapminder$pop / 1e6
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap pop_millions
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453     8.425333
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530     9.240934
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007    10.267083
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971    11.537966
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811    13.079460
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134    14.880372
+
+
+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

On a single graph, plot population, in millions, against year, for +all countries. Do not worry about identifying which country is +which.

+

Repeat the exercise, graphing only for China, India, and Indonesia. +Again, do not worry about which is which.

+
+
+
+
+
+ +
+
+

Refresh your plotting skills by plotting population in millions +against year.

+
+

R +

+
+ggplot(gapminder, aes(x = year, y = pop_millions)) +
+ geom_point()
+
+
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.
+

R +

+
+countryset <- c("China","India","Indonesia")
+ggplot(gapminder[gapminder$country %in% countryset,],
+       aes(x = year, y = pop_millions)) +
+  geom_point()
+
+
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.
+
+
+
+
+

Comparison operators, logical operators, and many functions are also +vectorized:

+

Comparison operators

+
+

R +

+
+x > 2
+
+
+

OUTPUT +

+
[1] FALSE FALSE  TRUE  TRUE
+
+

Logical operators

+
+

R +

+
+a <- x > 3  # or, for clarity, a <- (x > 3)
+a
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE  TRUE
+
+
+
+ +
+
+

Tip: some useful functions for logical +vectors +

+
+

any() will return TRUE if any +element of a vector is TRUE.
all() will return TRUE if all +elements of a vector are TRUE.

+
+
+
+

Most functions also operate element-wise on vectors:

+

Functions

+
+

R +

+
+x <- 1:4
+log(x)
+
+
+

OUTPUT +

+
[1] 0.0000000 0.6931472 1.0986123 1.3862944
+
+

Vectorized operations work element-wise on matrices:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m * -1
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]   -1   -4   -7  -10
+[2,]   -2   -5   -8  -11
+[3,]   -3   -6   -9  -12
+
+
+
+ +
+
+

Tip: element-wise vs. matrix +multiplication +

+
+

Very important: the operator * gives you element-wise +multiplication! To do matrix multiplication, we need to use the +%*% operator:

+
+

R +

+
+m %*% matrix(1, nrow=4, ncol=1)
+
+
+

OUTPUT +

+
     [,1]
+[1,]   22
+[2,]   26
+[3,]   30
+
+
+

R +

+
+matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)
+
+
+

OUTPUT +

+
     [,1]
+[1,]   30
+
+

For more on matrix algebra, see the Quick-R +reference guide

+
+
+
+
+
+ +
+
+

Challenge 3 +

+
+

Given the following matrix:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12
+
+

Write down what you think will happen when you run:

+
    +
  1. m ^ -1
  2. +
  3. m * c(1, 0, -1)
  4. +
  5. m > c(0, 20)
  6. +
  7. m * c(1, 0, -1, 2)
  8. +
+

Did you get the output you expected? If not, ask a helper!

+
+
+
+
+
+ +
+
+

Given the following matrix:

+
+

R +

+
+m <- matrix(1:12, nrow=3, ncol=4)
+m
+
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12
+
+

Write down what you think will happen when you run:

+
    +
  1. m ^ -1
  2. +
+
+

OUTPUT +

+
          [,1]      [,2]      [,3]       [,4]
+[1,] 1.0000000 0.2500000 0.1428571 0.10000000
+[2,] 0.5000000 0.2000000 0.1250000 0.09090909
+[3,] 0.3333333 0.1666667 0.1111111 0.08333333
+
+
    +
  1. m * c(1, 0, -1)
  2. +
+
+

OUTPUT +

+
     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    0    0    0    0
+[3,]   -3   -6   -9  -12
+
+
    +
  1. m > c(0, 20)
  2. +
+
+

OUTPUT +

+
      [,1]  [,2]  [,3]  [,4]
+[1,]  TRUE FALSE  TRUE FALSE
+[2,] FALSE  TRUE FALSE  TRUE
+[3,]  TRUE FALSE  TRUE FALSE
+
+
+
+
+
+
+
+ +
+
+

Challenge 4 +

+
+

We’re interested in looking at the sum of the following sequence of +fractions:

+
+

R +

+
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+
+

This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000?

+
+
+
+
+
+ +
+
+

We’re interested in looking at the sum of the following sequence of +fractions:

+
+

R +

+
+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)
+
+

This would be tedious to type out, and impossible for high values of +n. Can you use vectorisation to compute x, when n=100? How about when +n=10,000?

+
+

R +

+
+sum(1/(1:100)^2)
+
+
+

OUTPUT +

+
[1] 1.634984
+
+
+

R +

+
+sum(1/(1:1e04)^2)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+

R +

+
+n <- 10000
+sum(1/(1:n)^2)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+

We can also obtain the same results using a function:

+
+

R +

+
+inverse_sum_of_squares <- function(n) {
+  sum(1/(1:n)^2)
+}
+inverse_sum_of_squares(100)
+
+
+

OUTPUT +

+
[1] 1.634984
+
+
+

R +

+
+inverse_sum_of_squares(10000)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+

R +

+
+n <- 10000
+inverse_sum_of_squares(n)
+
+
+

OUTPUT +

+
[1] 1.644834
+
+
+
+
+
+
+
+ +
+
+

Tip: Operations on vectors of unequal +length +

+
+

Operations can also be performed on vectors of unequal length, +through a process known as recycling. This process +automatically repeats the smaller vector until it matches the length of +the larger vector. R will provide a warning if the larger vector is not +a multiple of the smaller vector.

+
+

R +

+
+x <- c(1, 2, 3)
+y <- c(1, 2, 3, 4, 5, 6, 7)
+x + y
+
+
+

WARNING +

+
Warning in x + y: longer object length is not a multiple of shorter object
+length
+
+
+

OUTPUT +

+
[1] 2 4 6 5 7 9 8
+
+

Vector x was recycled to match the length of vector +y

+
+

R +

+
x:  1  2  3  1  2  3  1
+    +  +  +  +  +  +  +
+y:  1  2  3  4  5  6  7
+-----------------------
+    2  4  6  5  7  9  8
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use vectorized operations instead of loops.
  • +
+
+
+

Content from Functions Explained

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 60 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I write a new function in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Define a function that takes arguments.
  • +
  • Return a value from a function.
  • +
  • Check argument conditions with stopifnot() in +functions.
  • +
  • Test a function.
  • +
  • Set default values for function arguments.
  • +
  • Explain why we should divide programs into small, single-purpose +functions.
  • +
+
+
+
+
+
+

If we only had one data set to analyze, it would probably be faster +to load the file into a spreadsheet and use that to plot simple +statistics. However, the gapminder data is updated periodically, and we +may want to pull in that new information later and re-run our analysis +again. We may also obtain similar data from a different source in the +future.

+

In this lesson, we’ll learn how to write a function so that we can +repeat several operations with a single command.

+
+
+ +
+
+

What is a function? +

+
+

Functions gather a sequence of operations into a whole, preserving it +for ongoing use. Functions provide:

+
    +
  • a name we can remember and invoke it by
  • +
  • relief from the need to remember the individual operations
  • +
  • a defined set of inputs and expected outputs
  • +
  • rich connections to the larger programming environment
  • +
+

As the basic building block of most programming languages, +user-defined functions constitute “programming” as much as any single +abstraction can. If you have written a function, you are a computer +programmer.

+
+
+
+

Defining a function +

+
+

Let’s open a new R script file in the functions/ +directory and call it functions-lesson.R.

+

The general structure of a function is:

+
+

R +

+
+my_function <- function(parameters) {
+  # perform action
+  # return value
+}
+
+

Let’s define a function fahr_to_kelvin() that converts +temperatures from Fahrenheit to Kelvin:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

We define fahr_to_kelvin() by assigning it to the output +of function. The list of argument names are contained +within parentheses. Next, the body of +the function–the statements that are executed when it runs–is contained +within curly braces ({}). The statements in the body are +indented by two spaces. This makes the code easier to read but does not +affect how the code operates.

+

It is useful to think of creating functions like writing a cookbook. +First you define the “ingredients” that your function needs. In this +case, we only need one ingredient to use our function: “temp”. After we +list our ingredients, we then say what we will do with them, in this +case, we are taking our ingredient and applying a set of mathematical +operators to it.

+

When we call the function, the values we pass to it as arguments are +assigned to those variables so that we can use them inside the function. +Inside the function, we use a return statement to send a +result back to whoever asked for it.

+
+
+ +
+
+

Tip +

+
+

One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the +body of the function. But for clarity, we will explicitly define the +return statement.

+
+
+
+

Let’s try running our function. Calling our own function is no +different from calling any other function:

+
+

R +

+
+# freezing point of water
+fahr_to_kelvin(32)
+
+
+

OUTPUT +

+
[1] 273.15
+
+
+

R +

+
+# boiling point of water
+fahr_to_kelvin(212)
+
+
+

OUTPUT +

+
[1] 373.15
+
+
+
+ +
+
+

Challenge 1 +

+
+

Write a function called kelvin_to_celsius() that takes a +temperature in Kelvin and returns that temperature in Celsius.

+

Hint: To convert from Kelvin to Celsius you subtract 273.15

+
+
+
+
+
+ +
+
+

Write a function called kelvin_to_celsius that takes a +temperature in Kelvin and returns that temperature in Celsius

+
+

R +

+
+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)
+}
+
+
+
+
+
+

Combining functions +

+
+

The real power of functions comes from mixing, matching and combining +them into ever-larger chunks to get the effect we want.

+

Let’s define two functions that will convert temperature from +Fahrenheit to Kelvin, and Kelvin to Celsius:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+kelvin_to_celsius <- function(temp) {
+  celsius <- temp - 273.15
+  return(celsius)
+}
+
+
+
+ +
+
+

Challenge 2 +

+
+

Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer).

+
+
+
+
+
+ +
+
+

Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above

+
+

R +

+
+fahr_to_celsius <- function(temp) {
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)
+}
+
+
+
+
+
+

Interlude: Defensive Programming +

+
+

Now that we’ve begun to appreciate how writing functions provides an +efficient way to make R code re-usable and modular, we should note that +it is important to ensure that functions only work in their intended +use-cases. Checking function parameters is related to the concept of +defensive programming. Defensive programming encourages us to +frequently check conditions and throw an error if something is wrong. +These checks are referred to as assertion statements because we want to +assert some condition is TRUE before proceeding. They make +it easier to debug because they give us a better idea of where the +errors originate.

+
+

Checking conditions with stopifnot() + +

+

Let’s start by re-examining fahr_to_kelvin(), our +function for converting temperatures from Fahrenheit to Kelvin. It was +defined like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

For this function to work as intended, the argument temp +must be a numeric value; otherwise, the mathematical +procedure for converting between the two temperature scales will not +work. To create an error, we can use the function stop(). +For example, since the argument temp must be a +numeric vector, we could check for this condition with an +if statement and throw an error if the condition was +violated. We could augment our function above like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  if (!is.numeric(temp)) {
+    stop("temp must be a numeric vector.")
+  }
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

If we had multiple conditions or arguments to check, it would take +many lines of code to check all of them. Luckily R provides the +convenience function stopifnot(). We can list as many +requirements that should evaluate to TRUE; +stopifnot() throws an error if it finds one that is +FALSE. Listing these conditions also serves a secondary +purpose as extra documentation for the function.

+

Let’s try out defensive programming with stopifnot() by +adding assertions to check the input to our function +fahr_to_kelvin().

+

We want to assert the following: temp is a numeric +vector. We may do that like so:

+
+

R +

+
+fahr_to_kelvin <- function(temp) {
+  stopifnot(is.numeric(temp))
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+}
+
+

It still works when given proper input.

+
+

R +

+
+# freezing point of water
+fahr_to_kelvin(temp = 32)
+
+
+

OUTPUT +

+
[1] 273.15
+
+

But fails instantly if given improper input.

+
+

R +

+
+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp = as.factor(32))
+
+
+

ERROR +

+
Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+
+
+
+ +
+
+

Challenge 3 +

+
+

Use defensive programming to ensure that our +fahr_to_celsius() function throws an error immediately if +the argument temp is specified inappropriately.

+
+
+
+
+
+ +
+
+

Extend our previous definition of the function by adding in an +explicit call to stopifnot(). Since +fahr_to_celsius() is a composition of two other functions, +checking inside here makes adding checks to the two component functions +redundant.

+
+

R +

+
+fahr_to_celsius <- function(temp) {
+  stopifnot(is.numeric(temp))
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)
+}
+
+
+
+
+
+
+

More on combining functions +

+
+

Now, we’re going to define a function that calculates the Gross +Domestic Product of a nation from the data available in our dataset:

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat) {
+  gdp <- dat$pop * dat$gdpPercap
+  return(gdp)
+}
+
+

We define calcGDP() by assigning it to the output of +function. The list of argument names are contained within +parentheses. Next, the body of the function -- the statements executed +when you call the function – is contained within curly braces +({}).

+

We’ve indented the statements in the body by two spaces. This makes +the code easier to read but does not affect how it operates.

+

When we call the function, the values we pass to it are assigned to +the arguments, which become variables inside the body of the +function.

+

Inside the function, we use the return() function to +send back the result. This return() function is optional: R +will automatically return the results of whatever command is executed on +the last line of the function.

+
+

R +

+
+calcGDP(head(gapminder))
+
+
+

OUTPUT +

+
[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231
+
+

That’s not very informative. Let’s add some more arguments so we can +extract that per year and country.

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

If you’ve been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by +using the source() function:

+
+

R +

+
+source("functions/functions-lesson.R")
+
+

Ok, so there’s a lot going on in this function now. In plain English, +the function now subsets the provided data by year if the year argument +isn’t empty, then subsets the result by country if the country argument +isn’t empty. Then it calculates the GDP for whatever subset emerges from +the previous two steps. The function then adds the GDP as a new column +to the subsetted data and returns this as the final result. You can see +that the output is much more informative than a vector of numbers.

+

Let’s take a look at what happens when we specify the year:

+
+

R +

+
+head(calcGDP(gapminder, year=2007))
+
+
+

OUTPUT +

+
       country year      pop continent lifeExp  gdpPercap          gdp
+12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
+24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
+36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
+48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
+60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
+72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894
+
+

Or for a specific country:

+
+

R +

+
+calcGDP(gapminder, country="Australia")
+
+
+

OUTPUT +

+
     country year      pop continent lifeExp gdpPercap          gdp
+61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
+62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
+63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
+64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
+65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
+66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
+67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
+68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
+69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
+70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
+71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894
+
+

Or both:

+
+

R +

+
+calcGDP(gapminder, year=2007, country="Australia")
+
+
+

OUTPUT +

+
     country year      pop continent lifeExp gdpPercap          gdp
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894
+
+

Let’s walk through the body of the function:

+
+

R +

+
calcGDP <- function(dat, year=NULL, country=NULL) {
+
+

Here we’ve added two arguments, year, and +country. We’ve set default arguments for both as +NULL using the = operator in the function +definition. This means that those arguments will take on those values +unless the user specifies otherwise.

+
+

R +

+
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+
+

Here, we check whether each additional argument is set to +null, and whenever they’re not null overwrite +the dataset stored in dat with a subset given by the +non-null argument.

+

Building these conditionals into the function makes it more flexible +for later. Now, we can use it to calculate the GDP for:

+
    +
  • The whole dataset;
  • +
  • A single year;
  • +
  • A single country;
  • +
  • A single combination of year and country.
  • +
+

By using %in% instead, we can also give multiple years +or countries to those arguments.

+
+
+ +
+
+

Tip: Pass by value +

+
+

Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify dat inside the +function we are modifying the copy of the gapminder dataset stored in +dat, not the original variable we gave as the first +argument.

+

This is called “pass-by-value” and it makes writing code much safer: +you can always be sure that whatever changes you make within the body of +the function, stay inside the body of the function.

+
+
+
+
+
+ +
+
+

Tip: Function scope +

+
+

Another important concept is scoping: any variables (or functions!) +you create or modify inside the body of a function only exist for the +lifetime of the function’s execution. When we call +calcGDP(), the variables dat, gdp +and new only exist inside the body of the function. Even if +we have variables of the same name in our interactive R session, they +are not modified in any way when executing a function.

+
+
+
+
+

R +

+
  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

Finally, we calculated the GDP on our new subset, and created a new +data frame with that column added. This means when we call the function +later we can see the context for the returned GDP values, which is much +better than in our first attempt where we got a vector of numbers.

+
+
+ +
+
+

Challenge 4 +

+
+

Test out your GDP function by calculating the GDP for New Zealand in +1987. How does this differ from New Zealand’s GDP in 1952?

+
+
+
+
+
+ +
+
+
+

R +

+
+  calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand")
+
+

GDP for New Zealand in 1987: 65050008703

+

GDP for New Zealand in 1952: 21058193787

+
+
+
+
+
+
+ +
+
+

Challenge 5 +

+
+

The paste() function can be used to combine text +together, e.g:

+
+

R +

+
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+paste(best_practice, collapse=" ")
+
+
+

OUTPUT +

+
[1] "Write programs for people not computers"
+
+

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:

+
+

R +

+
+fence(text=best_practice, wrapper="***")
+
+

Note: the paste() function has an argument +called sep, which specifies the separator between text. The +default is a space: ” “. The default for paste0() is no +space”“.

+
+
+
+
+
+ +
+
+

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:

+
+

R +

+
+fence <- function(text, wrapper){
+  text <- c(wrapper, text, wrapper)
+  result <- paste(text, collapse = " ")
+  return(result)
+}
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+fence(text=best_practice, wrapper="***")
+
+
+

OUTPUT +

+
[1] "*** Write programs for people not computers ***"
+
+
+
+
+
+
+
+ +
+
+

Tip +

+
+

R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the R +Language Manual or this chapter from Advanced R Programming by Hadley +Wickham.

+
+
+
+
+
+ +
+
+

Tip: Testing and documenting +

+
+

It’s important to both test functions and document them: +Documentation helps you, and others, understand what the purpose of your +function is, and how to use it, and its important to make sure that your +function actually does what you think.

+

When you first start out, your workflow will probably look a lot like +this:

+
    +
  1. Write a function
  2. +
  3. Comment parts of the function to document its behaviour
  4. +
  5. Load in the source file
  6. +
  7. Experiment with it in the console to make sure it behaves as you +expect
  8. +
  9. Make any necessary bug fixes
  10. +
  11. Rinse and repeat.
  12. +
+

Formal documentation for functions, written in separate +.Rd files, gets turned into the documentation you see in +help files. The roxygen2 +package allows R coders to write documentation alongside the function +code and then process it into the appropriate .Rd files. +You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In +fact, packages are, in essence, bundles of functions with this formal +documentation. Loading your own functions through +source("functions.R") is equivalent to loading someone +else’s functions (or your own one day!) through +library("package").

+

Formal automated tests can be written using the testthat package.

+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +
+
+
+
+

Content from Writing Data

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 20 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I save plots and data created in R?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to write out plots and data from R.
  • +
+
+
+
+
+
+

Saving plots +

+
+

You have already seen how to save the most recent plot you create in +ggplot2, using the command ggsave. As a +refresher:

+
+

R +

+
+ggsave("My_most_recent_plot.pdf")
+
+

You can save a plot from within RStudio using the ‘Export’ button in +the ‘Plot’ window. This will give you the option of saving as a .pdf or +as .png, .jpg or other image formats.

+

Sometimes you will want to save plots without creating them in the +‘Plot’ window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you’re looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can’t stop +the loop to click ‘Export’ for each one.

+

In this case you can use a more flexible approach. The function +pdf creates a new pdf device. You can control the size and +resolution using the arguments to this function.

+
+

R +

+
+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
+  geom_line() +
+  theme(legend.position = "none")
+
+# You then have to make sure to turn off the pdf device!
+
+dev.off()
+
+

Open up this document and have a look.

+
+
+ +
+
+

Challenge 1 +

+
+

Rewrite your ‘pdf’ command to print a second page in the pdf, showing +a facet plot (hint: use facet_grid) of the same data with +one panel per continent.

+
+
+
+
+
+ +
+
+
+

R +

+
+pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
+p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
+  geom_line() +
+  theme(legend.position = "none")
+p
+p + facet_grid(~continent)
+dev.off()
+
+
+
+
+
+

The commands jpeg, png etc. are used +similarly to produce documents in different formats.

+

Writing data +

+
+

At some point, you’ll also want to write out data from R.

+

We can use the write.table function for this, which is +very similar to read.table from before.

+

Let’s create a data-cleaning script, for this analysis, we only want +to focus on the gapminder data for Australia:

+
+

R +

+
+aust_subset <- gapminder[gapminder$country == "Australia",]
+
+write.table(aust_subset,
+  file="cleaned-data/gapminder-aus.csv",
+  sep=","
+)
+
+

Let’s switch back to the shell to take a look at the data to make +sure it looks OK:

+
+

BASH +

+
head cleaned-data/gapminder-aus.csv
+
+
+

OUTPUT +

+
"country","year","pop","continent","lifeExp","gdpPercap"
+"61","Australia",1952,8691212,"Oceania",69.12,10039.59564
+"62","Australia",1957,9712569,"Oceania",70.33,10949.64959
+"63","Australia",1962,10794968,"Oceania",70.93,12217.22686
+"64","Australia",1967,11872264,"Oceania",71.1,14526.12465
+"65","Australia",1972,13177000,"Oceania",71.93,16788.62948
+"66","Australia",1977,14074100,"Oceania",73.49,18334.19751
+"67","Australia",1982,15184200,"Oceania",74.74,19477.00928
+"68","Australia",1987,16257249,"Oceania",76.32,21888.88903
+"69","Australia",1992,17481977,"Oceania",77.56,23424.76683
+
+

Hmm, that’s not quite what we wanted. Where did all these quotation +marks come from? Also the row numbers are meaningless.

+

Let’s look at the help file to work out how to change this +behaviour.

+
+

R +

+
+?write.table
+
+

By default R will wrap character vectors with quotation marks when +writing out to file. It will also write out the row and column +names.

+

Let’s fix this:

+
+

R +

+
+write.table(
+  gapminder[gapminder$country == "Australia",],
+  file="cleaned-data/gapminder-aus.csv",
+  sep=",", quote=FALSE, row.names=FALSE
+)
+
+

Now lets look at the data again using our shell skills:

+
+

BASH +

+
head cleaned-data/gapminder-aus.csv
+
+
+

OUTPUT +

+
country,year,pop,continent,lifeExp,gdpPercap
+Australia,1952,8691212,Oceania,69.12,10039.59564
+Australia,1957,9712569,Oceania,70.33,10949.64959
+Australia,1962,10794968,Oceania,70.93,12217.22686
+Australia,1967,11872264,Oceania,71.1,14526.12465
+Australia,1972,13177000,Oceania,71.93,16788.62948
+Australia,1977,14074100,Oceania,73.49,18334.19751
+Australia,1982,15184200,Oceania,74.74,19477.00928
+Australia,1987,16257249,Oceania,76.32,21888.88903
+Australia,1992,17481977,Oceania,77.56,23424.76683
+
+

That looks better!

+
+
+ +
+
+

Challenge 2 +

+
+

Write a data-cleaning script file that subsets the gapminder data to +include only data points collected since 1990.

+

Use this script to write out the new subset to a file in the +cleaned-data/ directory.

+
+
+
+
+
+ +
+
+
+

R +

+
+write.table(
+  gapminder[gapminder$year > 1990, ],
+  file = "cleaned-data/gapminder-after1990.csv",
+  sep = ",", quote = FALSE, row.names = FALSE
+)
+
+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +
+
+
+
+

Content from Splitting and Combining Data Frames with plyr

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 60 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I do different calculations on different sets of data?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to use the split-apply-combine strategy for data +analysis.
  • +
+
+
+
+
+
+

Previously we looked at how you can use functions to simplify your +code. We defined the calcGDP function, which takes the +gapminder dataset, and multiplies the population and GDP per capita +column. We also defined additional arguments so we could filter by +year and country:

+
+

R +

+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+
+  new <- cbind(dat, gdp=gdp)
+  return(new)
+}
+
+

A common task you’ll encounter when working with data, is that you’ll +want to run calculations on different groups within the data. In the +above, we were calculating the GDP by multiplying two columns together. +But what if we wanted to calculated the mean GDP per continent?

+

We could run calcGDP and then take the mean of each +continent:

+
+

R +

+
+withGDP <- calcGDP(gapminder)
+mean(withGDP[withGDP$continent == "Africa", "gdp"])
+
+
+

OUTPUT +

+
[1] 20904782844
+
+
+

R +

+
+mean(withGDP[withGDP$continent == "Americas", "gdp"])
+
+
+

OUTPUT +

+
[1] 379262350210
+
+
+

R +

+
+mean(withGDP[withGDP$continent == "Asia", "gdp"])
+
+
+

OUTPUT +

+
[1] 227233738153
+
+

But this isn’t very nice. Yes, by using a function, you have +reduced a substantial amount of repetition. That is +nice. But there is still repetition. Repeating yourself will cost you +time, both now and later, and potentially introduce some nasty bugs.

+

We could write a new function that is flexible like +calcGDP, but this also takes a substantial amount of effort +and testing to get right.

+

The abstract problem we’re encountering here is know as +“split-apply-combine”:

+
Split apply combine

We want to split our data into groups, in this case +continents, apply some calculations on that group, then +optionally combine the results together afterwards.

+

The plyr package +

+
+

For those of you who have used R before, you might be familiar with +the apply family of functions. While R’s built in functions +do work, we’re going to introduce you to another method for solving the +“split-apply-combine” problem. The plyr package provides a set of +functions that we find more user friendly for solving this problem.

+

We installed this package in an earlier challenge. Let us load it +now:

+
+

R +

+
+library("plyr")
+
+

Plyr has functions for operating on lists, +data.frames and arrays (matrices, or +n-dimensional vectors). Each function performs:

+
    +
  1. A splitting operation
  2. +
  3. +Apply a function on each split in turn.
  4. +
  5. Recombine output data as a single data object.
  6. +
+

The functions are named based on the data structure they expect as +input, and the data structure you want returned as output: [a]rray, +[l]ist, or [d]ata.frame. The first letter corresponds to the input data +structure, the second letter to the output data structure, and then the +rest of the function is named “ply”.

+

This gives us 9 core functions **ply. There are an additional three +functions which will only perform the split and apply steps, and not any +combine step. They’re named by their input data type and represent null +output by a _ (see table)

+

Note here that plyr’s use of “array” is different to R’s, an array in +ply can include a vector or matrix.

+
Full apply suite

Each of the xxply functions (daply, ddply, +llply, laply, …) has the same structure and +has 4 key features and structure:

+
+

R +

+
+xxply(.data, .variables, .fun)
+
+
    +
  • The first letter of the function name gives the input type and the +second gives the output type.
  • +
  • .data - gives the data object to be processed
  • +
  • .variables - identifies the splitting variables
  • +
  • .fun - gives the function to be called on each piece
  • +
+

Now we can quickly calculate the mean GDP per continent:

+
+

R +

+
+ddply(
+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
  continent           V1
+1    Africa  20904782844
+2  Americas 379262350210
+3      Asia 227233738153
+4    Europe 269442085301
+5   Oceania 188187105354
+
+

Let us walk through the previous code:

+
    +
  • The ddply function feeds in a data.frame +(function starts with d) and returns another +data.frame (2nd letter is a d)
  • +
  • the first argument we gave was the data.frame we wanted to operate +on: in this case the gapminder data. We called calcGDP on +it first so that it would have the additional gdp column +added to it.
  • +
  • The second argument indicated our split criteria: in this case the +“continent” column. Note that we gave the name of the column, not the +values of the column like we had done previously with subsetting. Plyr +takes care of these implementation details for you.
  • +
  • The third argument is the function we want to apply to each grouping +of the data. We had to define our own short function here: each subset +of the data gets stored in x, the first argument of our +function. This is an anonymous function: we haven’t defined it +elsewhere, and it has no name. It only exists in the scope of our call +to ddply.
  • +
+
+
+ +
+
+

Challenge 1 +

+
+

Calculate the average life expectancy per continent. Which has the +longest? Which has the shortest?

+
+
+
+
+
+ +
+
+
+

R +

+
+ddply(
+ .data = gapminder,
+ .variables = "continent",
+ .fun = function(x) mean(x$lifeExp)
+)
+
+

Oceania has the longest and Africa the shortest.

+
+
+
+
+

What if we want a different type of output data structure?:

+
+

R +

+
+dlply(
+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
$Africa
+[1] 20904782844
+
+$Americas
+[1] 379262350210
+
+$Asia
+[1] 227233738153
+
+$Europe
+[1] 269442085301
+
+$Oceania
+[1] 188187105354
+
+attr(,"split_type")
+[1] "data.frame"
+attr(,"split_labels")
+  continent
+1    Africa
+2  Americas
+3      Asia
+4    Europe
+5   Oceania
+
+

We called the same function again, but changed the second letter to +an l, so the output was returned as a list.

+

We can specify multiple columns to group by:

+
+

R +

+
+ddply(
+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
   continent year           V1
+1     Africa 1952   5992294608
+2     Africa 1957   7359188796
+3     Africa 1962   8784876958
+4     Africa 1967  11443994101
+5     Africa 1972  15072241974
+6     Africa 1977  18694898732
+7     Africa 1982  22040401045
+8     Africa 1987  24107264108
+9     Africa 1992  26256977719
+10    Africa 1997  30023173824
+11    Africa 2002  35303511424
+12    Africa 2007  45778570846
+13  Americas 1952 117738997171
+14  Americas 1957 140817061264
+15  Americas 1962 169153069442
+16  Americas 1967 217867530844
+17  Americas 1972 268159178814
+18  Americas 1977 324085389022
+19  Americas 1982 363314008350
+20  Americas 1987 439447790357
+21  Americas 1992 489899820623
+22  Americas 1997 582693307146
+23  Americas 2002 661248623419
+24  Americas 2007 776723426068
+25      Asia 1952  34095762661
+26      Asia 1957  47267432088
+27      Asia 1962  60136869012
+28      Asia 1967  84648519224
+29      Asia 1972 124385747313
+30      Asia 1977 159802590186
+31      Asia 1982 194429049919
+32      Asia 1987 241784763369
+33      Asia 1992 307100497486
+34      Asia 1997 387597655323
+35      Asia 2002 458042336179
+36      Asia 2007 627513635079
+37    Europe 1952  84971341466
+38    Europe 1957 109989505140
+39    Europe 1962 138984693095
+40    Europe 1967 173366641137
+41    Europe 1972 218691462733
+42    Europe 1977 255367522034
+43    Europe 1982 279484077072
+44    Europe 1987 316507473546
+45    Europe 1992 342703247405
+46    Europe 1997 383606933833
+47    Europe 2002 436448815097
+48    Europe 2007 493183311052
+49   Oceania 1952  54157223944
+50   Oceania 1957  66826828013
+51   Oceania 1962  82336453245
+52   Oceania 1967 105958863585
+53   Oceania 1972 134112109227
+54   Oceania 1977 154707711162
+55   Oceania 1982 176177151380
+56   Oceania 1987 209451563998
+57   Oceania 1992 236319179826
+58   Oceania 1997 289304255183
+59   Oceania 2002 345236880176
+60   Oceania 2007 403657044512
+
+
+

R +

+
+daply(
+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)
+)
+
+
+

OUTPUT +

+
          year
+continent          1952         1957         1962         1967         1972
+  Africa     5992294608   7359188796   8784876958  11443994101  15072241974
+  Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+  Asia      34095762661  47267432088  60136869012  84648519224 124385747313
+  Europe    84971341466 109989505140 138984693095 173366641137 218691462733
+  Oceania   54157223944  66826828013  82336453245 105958863585 134112109227
+          year
+continent          1977         1982         1987         1992         1997
+  Africa    18694898732  22040401045  24107264108  26256977719  30023173824
+  Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+  Asia     159802590186 194429049919 241784763369 307100497486 387597655323
+  Europe   255367522034 279484077072 316507473546 342703247405 383606933833
+  Oceania  154707711162 176177151380 209451563998 236319179826 289304255183
+          year
+continent          2002         2007
+  Africa    35303511424  45778570846
+  Americas 661248623419 776723426068
+  Asia     458042336179 627513635079
+  Europe   436448815097 493183311052
+  Oceania  345236880176 403657044512
+
+

You can use these functions in place of for loops (and +it is usually faster to do so). To replace a for loop, put the code that +was in the body of the for loop inside an anonymous +function.

+
+

R +

+
+d_ply(
+  .data=gapminder,
+  .variables = "continent",
+  .fun = function(x) {
+    meanGDPperCap <- mean(x$gdpPercap)
+    print(paste(
+      "The mean GDP per capita for", unique(x$continent),
+      "is", format(meanGDPperCap, big.mark=",")
+   ))
+  }
+)
+
+
+

OUTPUT +

+
[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+
+
+
+ +
+
+

Tip: printing numbers +

+
+

The format function can be used to make numeric values +“pretty” for printing out in messages.

+
+
+
+
+
+ +
+
+

Challenge 2 +

+
+

Calculate the average life expectancy per continent and year. Which +had the longest and shortest in 2007? Which had the greatest change in +between 1952 and 2007?

+
+
+
+
+
+ +
+
+
+

R +

+
+solution <- ddply(
+ .data = gapminder,
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$lifeExp)
+)
+solution_2007 <- solution[solution$year == 2007, ]
+solution_2007
+
+

Oceania had the longest average life expectancy in 2007 and Africa +the lowest.

+
+

R +

+
+solution_1952_2007 <- cbind(solution[solution$year == 1952, ], solution_2007)
+difference_1952_2007 <- data.frame(continent = solution_1952_2007$continent,
+                                   year_1957 = solution_1952_2007[[3]],
+                                   year_2007 = solution_1952_2007[[6]],
+                                   difference = solution_1952_2007[[6]] - solution_1952_2007[[3]])
+difference_1952_2007
+
+

Asia had the greatest difference, and Oceania the least.

+
+
+
+
+
+
+ +
+
+

Alternate Challenge +

+
+

Without running them, which of the following will calculate the +average life expectancy per continent:

+
  1. +
+
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = gapminder$continent,
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
  1. +
+
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = mean(dataGroup$lifeExp)
+)
+
+
  1. +
+
+

R +

+
+ddply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
  1. +
+
+

R +

+
+adply(
+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+)
+
+
+
+
+
+
+ +
+
+

Answer 3 will calculate the average life expectancy per +continent.

+
+
+
+
+
+
+ +
+
+

Keypoints +

+
+
    +
  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +
+
+
+
+

Content from Data Frame Manipulation with dplyr

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 55 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I manipulate data frames without repeating myself?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To be able to use the six main data frame manipulation ‘verbs’ with +pipes in dplyr.
  • +
  • To understand how group_by() and +summarize() can be combined to summarize datasets.
  • +
  • Be able to analyze a subset of data using logical filtering.
  • +
+
+
+
+
+
+

Manipulation of data frames means many things to many researchers: we +often select certain observations (rows) or variables (columns), we +often group the data by a certain variable(s), or we even calculate +summary statistics. We can do these operations using the normal base R +operations:

+
+

R +

+
+mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 2193.755
+
+
+

R +

+
+mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 7136.11
+
+
+

R +

+
+mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
+
+
+

OUTPUT +

+
[1] 7902.15
+
+

But this isn’t very nice because there is a fair bit of +repetition. Repeating yourself will cost you time, both now and later, +and potentially introduce some nasty bugs.

+

The dplyr package +

+
+

Luckily, the dplyr +package provides a number of very useful functions for manipulating data +frames in a way that will reduce the above repetition, reduce the +probability of making errors, and probably even save you some typing. As +an added bonus, you might even find the dplyr grammar +easier to read.

+
+
+ +
+
+

Tip: Tidyverse +

+
+

dplyr package belongs to a broader family of opinionated +R packages designed for data science called the “Tidyverse”. These +packages are specifically designed to work harmoniously together. Some +of these packages will be covered along this course, but you can find +more complete information here: https://www.tidyverse.org/.

+
+
+
+

Here we’re going to cover 5 of the most commonly used functions as +well as using pipes (%>%) to combine them.

+
    +
  1. select()
  2. +
  3. filter()
  4. +
  5. group_by()
  6. +
  7. summarize()
  8. +
  9. mutate()
  10. +
+

If you have have not installed this package earlier, please do +so:

+
+

R +

+
+install.packages('dplyr')
+
+

Now let’s load the package:

+
+

R +

+
+library("dplyr")
+
+

Using select() +

+
+

If, for example, we wanted to move forward with only a few of the +variables in our data frame we could use the select() +function. This will keep only the variables you select.

+
+

R +

+
+year_country_gdp <- select(gapminder, year, country, gdpPercap)
+
+

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.

+
+

R +

+
+smaller_gapminder_data <- select(gapminder, -continent)
+
+

If we open up year_country_gdp we’ll see that it only +contains the year, country and gdpPercap. Above we used ‘normal’ +grammar, but the strengths of dplyr lie in combining +several functions using pipes. Since the pipes grammar is unlike +anything we’ve seen in R before, let’s repeat what we’ve done above +using pipes.

+
+

R +

+
+year_country_gdp <- gapminder %>% select(year, country, gdpPercap)
+
+

To help you understand why we wrote that in that way, let’s walk +through it step by step. First we summon the gapminder data frame and +pass it on, using the pipe symbol %>%, to the next step, +which is the select() function. In this case we don’t +specify which data object we use in the select() function +since in gets that from the previous pipe. Fun Fact: +There is a good chance you have encountered pipes before in the shell. +In R, a pipe symbol is %>% while in the shell it is +| but the concept is the same!

+
+
+ +
+
+

Tip: Renaming data frame columns in dplyr +

+
+

In Chapter 4 we covered how you can rename columns with base R by +assigning a value to the output of the names() function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a +rename() function.

+

Within a pipeline, the syntax is +rename(new_name = old_name). For example, we may want to +rename the gdpPercap column name from our select() +statement above.

+
+

R +

+
+tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)
+
+head(tidy_gdp)
+
+
+

OUTPUT +

+
  year     country gdp_per_capita
+1 1952 Afghanistan       779.4453
+2 1957 Afghanistan       820.8530
+3 1962 Afghanistan       853.1007
+4 1967 Afghanistan       836.1971
+5 1972 Afghanistan       739.9811
+6 1977 Afghanistan       786.1134
+
+
+
+
+

Using filter() +

+
+

If we now want to move forward with the above, but only with European +countries, we can combine select and +filter

+
+

R +

+
+year_country_gdp_euro <- gapminder %>%
+    filter(continent == "Europe") %>%
+    select(year, country, gdpPercap)
+
+

If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below.

+
+

R +

+
+europe_lifeExp_2007 <- gapminder %>%
+  filter(continent == "Europe", year == 2007) %>%
+  select(country, lifeExp)
+
+
+
+ +
+
+

Challenge 1 +

+
+

Write a single command (which can span multiple lines and includes +pipes) that will produce a data frame that has the African values for +lifeExp, country and year, but +not for other Continents. How many rows does your data frame have and +why?

+
+
+
+
+
+ +
+
+
+

R +

+
+year_country_lifeExp_Africa <- gapminder %>%
+                           filter(continent == "Africa") %>%
+                           select(year, country, lifeExp)
+
+
+
+
+
+

As with last time, first we pass the gapminder data frame to the +filter() function, then we pass the filtered version of the +gapminder data frame to the select() function. +Note: The order of operations is very important in this +case. If we used ‘select’ first, filter would not be able to find the +variable continent since we would have removed it in the previous +step.

+

Using group_by() +

+
+

Now, we were supposed to be reducing the error prone repetitiveness +of what can be done with base R, but up to now we haven’t done that +since we would have to repeat the above for each continent. Instead of +filter(), which will only pass observations that meet your +criteria (in the above: continent=="Europe"), we can use +group_by(), which will essentially use every unique +criteria that you could have used in filter.

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+
+

R +

+
+str(gapminder %>% group_by(continent))
+
+
+

OUTPUT +

+
gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
+  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
+  ..$ .rows    : list<int> [1:5] 
+  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
+  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
+  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
+  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
+  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
+  .. ..@ ptype: int(0) 
+  ..- attr(*, ".drop")= logi TRUE
+
+

You will notice that the structure of the data frame where we used +group_by() (grouped_df) is not the same as the +original gapminder (data.frame). A +grouped_df can be thought of as a list where +each item in the listis a data.frame which +contains only the rows that correspond to the a particular value +continent (at least in the example above).

+
Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize() +

+
+

The above was a bit on the uneventful side but +group_by() is much more exciting in conjunction with +summarize(). This will allow us to create new variable(s) +by using functions that repeat for each of the continent-specific data +frames. That is to say, using the group_by() function, we +split our original data frame into multiple pieces, then we can run +functions (e.g. mean() or sd()) within +summarize().

+
+

R +

+
+gdp_bycontinents <- gapminder %>%
+    group_by(continent) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
+
+
Diagram illustrating the use of group by and summarize together to create a new variable
+

R +

+
continent mean_gdpPercap
+     <fctr>          <dbl>
+1    Africa       2193.755
+2  Americas       7136.110
+3      Asia       7902.150
+4    Europe      14469.476
+5   Oceania      18621.609
+
+

That allowed us to calculate the mean gdpPercap for each continent, +but it gets even better.

+
+
+ +
+
+

Challenge 2 +

+
+

Calculate the average life expectancy per country. Which has the +longest average life expectancy and which has the shortest average life +expectancy?

+
+
+
+
+
+ +
+
+
+

R +

+
+lifeExp_bycountry <- gapminder %>%
+   group_by(country) %>%
+   summarize(mean_lifeExp = mean(lifeExp))
+lifeExp_bycountry %>%
+   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
+
+
+

OUTPUT +

+
# A tibble: 2 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Iceland              76.5
+2 Sierra Leone         36.8
+
+

Another way to do this is to use the dplyr function +arrange(), which arranges the rows in a data frame +according to the order of one or more variables from the data frame. It +has similar syntax to other functions from the dplyr +package. You can use desc() inside arrange() +to sort in descending order.

+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(mean_lifeExp) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Sierra Leone         36.8
+
+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(desc(mean_lifeExp)) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country mean_lifeExp
+  <chr>          <dbl>
+1 Iceland         76.5
+
+

Alphabetical order works too

+
+

R +

+
+lifeExp_bycountry %>%
+   arrange(desc(country)) %>%
+   head(1)
+
+
+

OUTPUT +

+
# A tibble: 1 × 2
+  country  mean_lifeExp
+  <chr>           <dbl>
+1 Zimbabwe         52.7
+
+
+
+
+
+

The function group_by() allows us to group by multiple +variables. Let’s group by year and +continent.

+
+

R +

+
+gdp_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

That is already quite powerful, but it gets even better! You’re not +limited to defining 1 new variable in summarize().

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

count() and n() +

+
+

A very common operation is to count the number of observations for +each group. The dplyr package comes with two related +functions that help with this.

+

For instance, if we wanted to check the number of countries included +in the dataset for the year 2002, we can use the count() +function. It takes the name of one or more columns that contain the +groups we are interested in, and we can optionally sort the results in +descending order by adding sort=TRUE:

+
+

R +

+
+gapminder %>%
+    filter(year == 2002) %>%
+    count(continent, sort = TRUE)
+
+
+

OUTPUT +

+
  continent  n
+1    Africa 52
+2      Asia 33
+3    Europe 30
+4  Americas 25
+5   Oceania  2
+
+

If we need to use the number of observations in calculations, the +n() function is useful. It will return the total number of +observations in the current group rather than counting the number of +observations in each group within a specific column. For instance, if we +wanted to get the standard error of the life expectency per +continent:

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    summarize(se_le = sd(lifeExp)/sqrt(n()))
+
+
+

OUTPUT +

+
# A tibble: 5 × 2
+  continent se_le
+  <chr>     <dbl>
+1 Africa    0.366
+2 Americas  0.540
+3 Asia      0.596
+4 Europe    0.286
+5 Oceania   0.775
+
+

You can also chain together several summary operations; in this case +calculating the minimum, maximum, +mean and se of each continent’s per-country +life-expectancy:

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    summarize(
+      mean_le = mean(lifeExp),
+      min_le = min(lifeExp),
+      max_le = max(lifeExp),
+      se_le = sd(lifeExp)/sqrt(n()))
+
+
+

OUTPUT +

+
# A tibble: 5 × 5
+  continent mean_le min_le max_le se_le
+  <chr>       <dbl>  <dbl>  <dbl> <dbl>
+1 Africa       48.9   23.6   76.4 0.366
+2 Americas     64.7   37.6   80.7 0.540
+3 Asia         60.1   28.8   82.6 0.596
+4 Europe       71.9   43.6   81.8 0.286
+5 Oceania      74.3   69.1   81.2 0.775
+
+

Using mutate() +

+
+

We can also create new variables prior to (or even after) summarizing +information using mutate().

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
+    group_by(continent,year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

Connect mutate with logical filtering: ifelse +

+
+

When creating new variables, we can hook this with a logical +condition. A simple combination of mutate() and +ifelse() facilitates filtering right where it is needed: in +the moment of creating something new. This easy-to-read statement is a +fast and powerful way of discarding certain data (even though the +overall dimension of the data frame will not change) or for updating +values depending on this given condition.

+
+

R +

+
+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
+    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+

R +

+
+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
+    mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              mean_gdpPercap_expected = mean(gdp_futureExpectation))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+

Combining dplyr and ggplot2 +

+
+

First install and load ggplot2:

+
+

R +

+
+install.packages('ggplot2')
+
+
+

R +

+
+library("ggplot2")
+
+

In the plotting lesson we looked at how to make a multi-panel figure +by adding a layer of facet panels using ggplot2. Here is +the code we used (with some extra comments):

+
+

R +

+
+# Filter countries located in the Americas
+americas <- gapminder[gapminder$continent == "Americas", ]
+# Make the plot
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

This code makes the right plot but it also creates an intermediate +variable (americas) that we might not have any other uses +for. Just as we used %>% to pipe data along a chain of +dplyr functions we can use it to pass data to +ggplot(). Because %>% replaces the first +argument in a function we don’t need to specify the data = +argument in the ggplot() function. By combining +dplyr and ggplot2 functions we can make the +same figure without creating any new variables or modifying the +data.

+
+

R +

+
+gapminder %>%
+  # Filter countries located in the Americas
+  filter(continent == "Americas") %>%
+  # Make the plot
+  ggplot(mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))
+
+

More examples of using the function mutate() and the +ggplot2 package.

+
+

R +

+
+gapminder %>%
+  # extract first letter of country name into new column
+  mutate(startsWith = substr(country, 1, 1)) %>%
+  # only keep countries starting with A or Z
+  filter(startsWith %in% c("A", "Z")) %>%
+  # plot lifeExp into facets
+  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
+  geom_line() +
+  facet_wrap(vars(country)) +
+  theme_minimal()
+
+
+
+ +
+
+

Advanced Challenge +

+
+

Calculate the average life expectancy in 2002 of 2 randomly selected +countries for each continent. Then arrange the continent names in +reverse order. Hint: Use the dplyr +functions arrange() and sample_n(), they have +similar syntax to other dplyr functions.

+
+
+
+
+
+ +
+
+
+

R +

+
+lifeExp_2countries_bycontinents <- gapminder %>%
+   filter(year==2002) %>%
+   group_by(continent) %>%
+   sample_n(2) %>%
+   summarize(mean_lifeExp=mean(lifeExp)) %>%
+   arrange(desc(mean_lifeExp))
+
+
+
+
+
+

Other great resources +

+
+ +
+
+ +
+
+

Keypoints +

+
+
    +
  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +
+
+
+
+

Content from Data Frame Manipulation with tidyr

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 45 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I change the layout of a data frame?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • To understand the concepts of ‘longer’ and ‘wider’ data frame +formats and be able to convert between them with +tidyr.
  • +
+
+
+
+
+
+

Researchers often want to reshape their data frames from ‘wide’ to +‘longer’ layouts, or vice-versa. The ‘long’ layout or format is +where:

+
    +
  • each column is a variable
  • +
  • each row is an observation
  • +
+

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column +for the observed variable and the other columns are ID variables.

+

For the ‘wide’ format each row is often a site/subject/patient and +you have multiple observation variables containing the same type of +data. These can be either repeated observations over time, or +observation of multiple variables (or a mix of both). You may find data +input may be simpler or some other applications may prefer the ‘wide’ +format. However, many of R‘s functions have been designed +assuming you have ’longer’ formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format.

+
Diagram illustrating the difference between a wide versus long layout of a data frame

Long and wide data frame layouts mainly affect readability. For +humans, the wide format is often more intuitive since we can often see +more of the data on the screen due to its shape. However, the long +format is more machine readable and is closer to the formatting of +databases. The ID variables in our data frames are similar to the fields +in a database and observed variables are like the database values.

+

Getting started +

+
+

First install the packages if you haven’t already done so (you +probably installed dplyr in the previous lesson):

+
+

R +

+
+#install.packages("tidyr")
+#install.packages("dplyr")
+
+

Load the packages

+
+

R +

+
+library("tidyr")
+library("dplyr")
+
+

First, lets look at the structure of our original gapminder data +frame:

+
+

R +

+
+str(gapminder)
+
+
+

OUTPUT +

+
'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+
+
+
+ +
+
+

Challenge 1 +

+
+

Is gapminder a purely long, purely wide, or some intermediate +format?

+
+
+
+
+
+ +
+
+

The original gapminder data.frame is in an intermediate format. It is +not purely long since it had multiple observation variables +(pop,lifeExp,gdpPercap).

+
+
+
+
+

Sometimes, as with the gapminder dataset, we have multiple types of +observed data. It is somewhere in between the purely ‘long’ and ‘wide’ +data formats. We have 3 “ID variables” (continent, +country, year) and 3 “Observation variables” +(pop,lifeExp,gdpPercap). This +intermediate format can be preferred despite not having ALL observations +in 1 column given that all 3 observation variables have different units. +There are few operations that would need us to make this data frame any +longer (i.e. 4 ID variables and 1 Observation variable).

+

While using many of the functions in R, which are often vector based, +you usually do not want to do mathematical operations on values with +different units. For example, using the purely long format, a single +mean for all of the values of population, life expectancy, and GDP would +not be meaningful since it would return the mean of values with 3 +incompatible units. The solution is that we first manipulate the data +either by grouping (see the lesson on dplyr), or we change +the structure of the data frame. Note: Some plotting +functions in R actually work better in the wide format data.

+

From wide to long format with pivot_longer() +

+
+

Until now, we’ve been using the nicely formatted original gapminder +dataset, but ‘real’ data (i.e. our own research data) will never be so +well organized. Here let’s start with the wide formatted version of the +gapminder dataset.

+
+

Download the wide version of the gapminder data from here and save it in your data +folder.

+
+

We’ll load the data file and look at it. Note: we don’t want our +continent and country columns to be factors, so we use the +stringsAsFactors argument for read.csv() to disable +that.

+
+

R +

+
+gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
+str(gap_wide)
+
+
+

OUTPUT +

+
'data.frame':	142 obs. of  38 variables:
+ $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num  3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num  2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
+ $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
+
+
Diagram illustrating the wide format of the gapminder data frame

To change this very wide data frame layout back to our nice, +intermediate (or longer) layout, we will use one of the two available +pivot functions from the tidyr package. To +convert from wide to a longer format, we will use the +pivot_longer() function. pivot_longer() makes +datasets longer by increasing the number of rows and decreasing the +number of columns, or ‘lengthening’ your observation variables into a +single variable.

+
Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format
+

R +

+
+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+    names_to = "obstype_year", values_to = "obs_values"
+  )
+str(gap_long)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
+ $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...
+
+

Here we have used piping syntax which is similar to what we were +doing in the previous lesson with dplyr. In fact, these are compatible +and you can use a mix of tidyr and dplyr functions by piping them +together.

+

We first provide to pivot_longer() a vector of column +names that will be pivoted into longer format. We could type out all the +observation variables, but as in the select() function (see +dplyr lesson), we can use the starts_with() +argument to select all variables that start with the desired character +string. pivot_longer() also allows the alternative syntax +of using the - symbol to identify which variables are not +to be pivoted (i.e. ID variables).

+

The next arguments to pivot_longer() are +names_to for naming the column that will contain the new ID +variable (obstype_year) and values_to for +naming the new amalgamated observation variable +(obs_value). We supply these new column names as +strings.

+
Diagram illustrating the long format of the gapminder data
+

R +

+
+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(-continent, -country),
+    names_to = "obstype_year", values_to = "obs_values"
+  )
+str(gap_long)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values  : num [1:5112] 2449 3014 2551 3247 4183 ...
+
+

That may seem trivial with this particular data frame, but sometimes +you have 1 ID variable and 40 observation variables with irregular +variable names. The flexibility is a huge time saver!

+

Now obstype_year actually contains 2 pieces of +information, the observation type +(pop,lifeExp, or gdpPercap) and +the year. We can use the separate() function +to split the character strings into multiple variables

+
+

R +

+
+gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
+gap_long$year <- as.integer(gap_long$year)
+
+
+
+ +
+
+

Challenge 2 +

+
+

Using gap_long, calculate the mean life expectancy, +population, and gdpPercap for each continent. Hint: use +the group_by() and summarize() functions we +learned in the dplyr lesson

+
+
+
+
+
+ +
+
+
+

R +

+
+gap_long %>% group_by(continent, obs_type) %>%
+   summarize(means=mean(obs_values))
+
+
+

OUTPUT +

+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+

OUTPUT +

+
# A tibble: 15 × 3
+# Groups:   continent [5]
+   continent obs_type       means
+   <chr>     <chr>          <dbl>
+ 1 Africa    gdpPercap     2194. 
+ 2 Africa    lifeExp         48.9
+ 3 Africa    pop        9916003. 
+ 4 Americas  gdpPercap     7136. 
+ 5 Americas  lifeExp         64.7
+ 6 Americas  pop       24504795. 
+ 7 Asia      gdpPercap     7902. 
+ 8 Asia      lifeExp         60.1
+ 9 Asia      pop       77038722. 
+10 Europe    gdpPercap    14469. 
+11 Europe    lifeExp         71.9
+12 Europe    pop       17169765. 
+13 Oceania   gdpPercap    18622. 
+14 Oceania   lifeExp         74.3
+15 Oceania   pop        8874672. 
+
+
+
+
+
+

From long to intermediate format with pivot_wider() +

+
+

It is always good to check work. So, let’s use the second +pivot function, pivot_wider(), to ‘widen’ our +observation variables back out. pivot_wider() is the +opposite of pivot_longer(), making a dataset wider by +increasing the number of columns and decreasing the number of rows. We +can use pivot_wider() to pivot or reshape our +gap_long to the original intermediate format or the widest +format. Let’s start with the intermediate format.

+

The pivot_wider() function takes names_from +and values_from arguments.

+

To names_from we supply the column name whose contents +will be pivoted into new output columns in the widened data frame. The +corresponding values will be added from the column named in the +values_from argument.

+
+

R +

+
+gap_normal <- gap_long %>%
+  pivot_wider(names_from = obs_type, values_from = obs_values)
+dim(gap_normal)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+
+

R +

+
+dim(gapminder)
+
+
+

OUTPUT +

+
[1] 1704    6
+
+
+

R +

+
+names(gap_normal)
+
+
+

OUTPUT +

+
[1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"      
+
+
+

R +

+
+names(gapminder)
+
+
+

OUTPUT +

+
[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"
+
+

Now we’ve got an intermediate data frame gap_normal with +the same dimensions as the original gapminder, but the +order of the variables is different. Let’s fix that before checking if +they are all.equal().

+
+

R +

+
+gap_normal <- gap_normal[, names(gapminder)]
+all.equal(gap_normal, gapminder)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+[3] "Component \"country\": 1704 string mismatches"                                         
+[4] "Component \"pop\": Mean relative difference: 1.634504"                                 
+[5] "Component \"continent\": 1212 string mismatches"                                       
+[6] "Component \"lifeExp\": Mean relative difference: 0.203822"                             
+[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"                           
+
+
+

R +

+
+head(gap_normal)
+
+
+

OUTPUT +

+
# A tibble: 6 × 6
+  country  year      pop continent lifeExp gdpPercap
+  <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>
+1 Algeria  1952  9279525 Africa       43.1     2449.
+2 Algeria  1957 10270856 Africa       45.7     3014.
+3 Algeria  1962 11000948 Africa       48.3     2551.
+4 Algeria  1967 12760499 Africa       51.4     3247.
+5 Algeria  1972 14760787 Africa       54.5     4183.
+6 Algeria  1977 17152804 Africa       58.0     4910.
+
+
+

R +

+
+head(gapminder)
+
+
+

OUTPUT +

+
      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+
+

We’re almost there, the original was sorted by country, +then year.

+
+

R +

+
+gap_normal <- gap_normal %>% arrange(country, year)
+all.equal(gap_normal, gapminder)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+
+

That’s great! We’ve gone from the longest format back to the +intermediate and we didn’t introduce any errors in our code.

+

Now let’s convert the long all the way back to the wide. In the wide +format, we will keep country and continent as ID variables and pivot the +observations across the 3 metrics +(pop,lifeExp,gdpPercap) and time +(year). First we need to create appropriate labels for all +our new variables (time*metric combinations) and we also need to unify +our ID variables to simplify the process of defining +gap_wide.

+
+

R +

+
+gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")
+str(gap_temp)
+
+
+

OUTPUT +

+
tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ var_ID    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ obs_type  : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
+ $ year      : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+
+
+

R +

+
+gap_temp <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_")
+str(gap_temp)
+
+
+

OUTPUT +

+
tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
+ $ ID_var    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...
+
+

Using unite() we now have a single ID variable which is +a combination of continent,country,and we have +defined variable names. We’re now ready to pipe in +pivot_wider()

+
+

R +

+
+gap_wide_new <- gap_long %>%
+  unite(ID_var, continent, country, sep = "_") %>%
+  unite(var_names, obs_type, year, sep = "_") %>%
+  pivot_wider(names_from = var_names, values_from = obs_values)
+str(gap_wide_new)
+
+
+

OUTPUT +

+
tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
+ $ ID_var        : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+
+
+
+ +
+
+

Challenge 3 +

+
+

Take this 1 step further and create a +gap_ludicrously_wide format data by pivoting over +countries, year and the 3 metrics? Hint this new data +frame should only have 5 rows.

+
+
+
+
+
+ +
+
+
+

R +

+
+gap_ludicrously_wide <- gap_long %>%
+   unite(var_names, obs_type, year, country, sep = "_") %>%
+   pivot_wider(names_from = var_names, values_from = obs_values)
+
+
+
+
+
+

Now we have a great ‘wide’ format data frame, but the +ID_var could be more usable, let’s separate it into 2 +variables with separate()

+
+

R +

+
+gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
+gap_wide_betterID <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_") %>%
+    pivot_wider(names_from = var_names, values_from = obs_values) %>%
+    separate(ID_var, c("continent","country"), sep = "_")
+str(gap_wide_betterID)
+
+
+

OUTPUT +

+
tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
+ $ continent     : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+
+
+

R +

+
+all.equal(gap_wide, gap_wide_betterID)
+
+
+

OUTPUT +

+
[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+
+

There and back again!

+

Other great resources +

+
+ +
+
+ +
+
+

Keypoints +

+
+
    +
  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +
+
+
+
+

Content from Producing Reports With knitr

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 75 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I integrate software and reports?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Understand the value of writing reproducible reports
  • +
  • Learn how to recognise and compile the basic components of an R +Markdown file
  • +
  • Become familiar with R code chunks, and understand their purpose, +structure and options
  • +
  • Demonstrate the use of inline chunks for weaving R outputs into text +blocks, for example when discussing the results of some +calculations
  • +
  • Be aware of alternative output formats to which an R Markdown file +can be exported
  • +
+
+
+
+
+
+

Data analysis reports +

+
+

Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their work +for future reference.

+

Many new users begin by first writing a single R script containing +all of their work, and then share the analysis by emailing the script +and various graphs as attachments. But this can be cumbersome, requiring +a lengthy discussion to explain which attachment was which result.

+

Writing formal reports with Word or LaTeX can simplify this +process by incorporating both the analysis report and output graphs into +a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy +“whack-a-mole” game of fixing new mistakes resulting from a single +formatting change.

+

Creating a report as a web page (which is an html file) using R +Markdown makes things easier. The report can be one long stream, so tall +figures that wouldn’t ordinarily fit on one page can be kept at full +size and easier to read, since the reader can simply keep scrolling. +Additionally, the formatting of and R Markdown document is simple and +easy to modify, allowing you to spend more time on your analyses instead +of writing reports.

+

Literate programming +

+
+

Ideally, such analysis reports are reproducible documents: +If an error is discovered, or if some additional subjects are added to +the data, you can just re-compile the report and get the new or +corrected results rather than having to reconstruct figures, paste them +into a Word document, and hand-edit various detailed results.

+

The key R package here is knitr. It allows you +to create a document that is a mixture of text and chunks of code. When +the document is processed by knitr, chunks of code will be +executed, and graphs or other results will be inserted into the final +document.

+

This sort of idea has been called “literate programming”.

+

knitr allows you to mix basically any type of text with +code from different programming languages, but we recommend that you use +R Markdown, which mixes Markdown with R. Markdown is a light-weight +mark-up language for creating web pages.

+

Creating an R Markdown file +

+
+

Within RStudio, click File → New File → R Markdown and you’ll get a +dialog box like this:

+
Screenshot of the New R Markdown file dialogue box in RStudio

You can stick with the default (HTML output), but give it a +title.

+

Basic components of R Markdown +

+
+

The initial chunk of text (header) contains instructions for R to +specify what kind of document will be created, and the options chosen. +You can use the header to give your document a title, author, date, and +tell it what type of output you want to produce. In this case, we’re +creating an html document.

+
---
+title: "Initial R Markdown document"
+author: "Karl Broman"
+date: "April 23, 2015"
+output: html_document
+---
+

You can delete any of those fields if you don’t want them included. +The double-quotes aren’t strictly necessary in this case. +They’re mostly needed if you want to include a colon in the title.

+

RStudio creates the document with some example text to get you +started. Note below that there are chunks like

+
+```{r}
+summary(cars)
+```
+
+

These are chunks of R code that will be executed by +knitr and replaced by their results. More on this +later.

+

Markdown +

+
+

Markdown is a system for writing web pages by marking up the text +much as you would in an email rather than writing html code. The +marked-up text gets converted to html, replacing the marks with +the proper html code.

+

For now, let’s delete all of the stuff that’s there and write a bit +of markdown.

+

You make things bold using two asterisks, like this: +**bold**, and you make things italics by using +underscores, like this: _italics_.

+

You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this:

+
A list:
+
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks
+

or like this:

+
A second list:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+

Each will appear as:

+
    +
  • bold with double-asterisks
  • +
  • italics with underscores
  • +
  • code-type font with backticks
  • +
+

You can use whatever method you prefer, but be consistent. +This maintains the readability of your code.

+

You can make a numbered list by just using numbers. You can even use +the same number over and over if you want:

+
1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks
+

This will appear as:

+
    +
  1. bold with double-asterisks
  2. +
  3. italics with underscores
  4. +
  5. code-type font with backticks
  6. +
+

You can make section headers of different sizes by initiating a line +with some number of # symbols:

+
# Title
+## Main section
+### Sub-section
+#### Sub-sub section
+

You compile the R Markdown document to an html webpage by +clicking the “Knit” button in the upper-left.

+
+
+ +
+
+

Challenge 1 +

+
+

Create a new R Markdown document. Delete all of the R code chunks and +write a bit of Markdown (some sections, some italicized text, and an +itemized list).

+

Convert the document to a webpage.

+
+
+
+
+
+ +
+
+

In RStudio, select File > New file > R Markdown…

+

Delete the placeholder text and add the following:

+
# Introduction
+
+## Background on Data
+
+This report uses the *gapminder* dataset, which has columns that include:
+
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+
+## Background on Methods
+
+

Then click the ‘Knit’ button on the toolbar to generate an html +document (webpage).

+
+
+
+
+

A bit more Markdown +

+
+

You can make a hyperlink like this: +[Carpentries Home Page](https://carpentries.org/).

+

You can include an image file like this: +![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)

+

You can do subscripts (e.g., F2) with F~2~ +and superscripts (e.g., F2) with F^2^.

+

If you know how to write equations in LaTeX, you can use +$ $ and $$ $$ to insert math equations, like +$E = mc^2$ and

+
$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$
+

You can review Markdown syntax by navigating to the “Markdown Quick +Reference” under the “Help” field in the toolbar at the top of +RStudio.

+

R code chunks +

+
+

The real power of Markdown comes from mixing markdown with chunks of +code. This is R Markdown. When processed, the R code will be executed; +if they produce figures, the figures will be inserted in the final +document.

+

The main code chunks look like this:

+
+```{r load_data}
+gapminder 
+

That is, you place a chunk of R code between ```{r +chunk_name} and ```. You should give each chunk a +unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the +shortcuts Ctrl+Alt+I on Windows and +Linux, or Cmd+Option+I on Mac.

+
+
+ +
+
+

Challenge 2 +

+
+

Add code chunks to:

+
    +
  • Load the ggplot2 package
  • +
  • Read the gapminder data
  • +
  • Create a plot
  • +
+
+
+
+
+
+ +
+
+
+```{r load-ggplot2}
+library("ggplot2")
+```
+
+
+```{r read-gapminder-data}
+gapminder 
+
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+
+
+
+
+
+

How things get compiled +

+
+

When you press the “Knit” button, the R Markdown document is +processed by knitr +and a plain Markdown document is produced (as well as, potentially, a +set of figure files): the R code is executed and replaced by both the +input and the output; if figures are produced, links to those figures +are included.

+

The Markdown and figure documents are then processed by the tool pandoc, which converts the +Markdown file into an html file, with the figures embedded.

+

Chunk options +

+
+

There are a variety of options to affect how the code chunks are +treated. Here are some examples:

+
    +
  • Use echo=FALSE to avoid having the code itself +shown.
  • +
  • Use results="hide" to avoid having any results +printed.
  • +
  • Use eval=FALSE to have the code shown but not +evaluated.
  • +
  • Use warning=FALSE and message=FALSE to +hide any warnings or messages produced.
  • +
  • Use fig.height and fig.width to control +the size of the figures produced (in inches).
  • +
+

So you might write:

+
+```{r load_libraries, echo=FALSE, message=FALSE}
+library("dplyr")
+library("ggplot2")
+```
+
+

Often there will be particular options that you’ll want to use +repeatedly; for this, you can set global chunk options, like +so:

+
+```{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)
+```
+
+

The fig.path option defines where the figures will be +saved. The / here is really important; without it, the +figures would be saved in the standard place but just with names that +begin with Figs.

+

If you have multiple R Markdown files in a common directory, you +might want to use fig.path to define separate prefixes for +the figure file names, like fig.path="Figs/cleaning-" and +fig.path="Figs/analysis-".

+
+
+ +
+
+

Challenge 3 +

+
+

Use chunk options to control the size of a figure and to hide the +code.

+
+
+
+
+
+ +
+
+
+```{r echo = FALSE, fig.width = 3}
+plot(faithful)
+```
+
+
+
+
+
+

You can review all of the R chunk options by navigating +to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the +“Help” field in the toolbar at the top of RStudio.

+

Inline R code +

+
+

You can make every number in your report reproducible. Use +`r and ` for an in-line code chunk, like so: +`r round(some_value, 2)`. The code will be executed and +replaced with the value of the result.

+

Don’t let these in-line chunks get split across lines.

+

Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with include=FALSE for +that larger chunk (which is the same as echo=FALSE and +results="hide").

+

Rounding can produce differences in output in such situations. You +may want 2.0, but round(2.03, 1) will give +just 2.

+

The myround +function in the R/broman +package handles this.

+
+
+ +
+
+

Challenge 4 +

+
+

Try out a bit of in-line R code.

+
+
+
+
+
+ +
+
+

Here’s some inline code to determine that 2 + 2 = 4.

+
+
+
+
+

Other output options +

+
+

You can also convert R Markdown to a PDF or a Word document. Click +the little triangle next to the “Knit” button to get a drop-down menu. +Or you could put pdf_document or word_document +in the initial header of the file.

+
+
+ +
+
+

Tip: Creating PDF documents +

+
+

Creating .pdf documents may require installation of some extra +software. The R package tinytex provides some tools to help +make this process easier for R users. With tinytex +installed, run tinytex::install_tinytex() to install the +required software (you’ll only need to do this once) and then when you +knit to pdf tinytex will automatically detect and install +any additional LaTeX packages that are needed to produce the pdf +document. Visit the tinytex +website for more information.

+
+
+
+
+
+ +
+
+

Tip: Visual markdown editing in RStudio +

+
+

RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.

+
+
+
+

Resources +

+
+ +
+
+ +
+
+

Keypoints +

+
+
    +
  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +
+
+
+
+

Content from Writing Good Software

+
+

Last updated on 2023-10-26 | + + Edit this page

+

Estimated time 15 minutes

+
+ +
+
+

Overview

+
+
+
+
+

Questions

+
    +
  • How can I write software that other people can use?
  • +
+
+
+
+
+
+
+

Objectives

+
    +
  • Describe best practices for writing R and explain the justification +for each.
  • +
+
+
+
+
+
+

Structure your project folder +

+
+

Keep your project folder structured, organized and tidy, by creating +subfolders for your code files, manuals, data, binaries, output plots, +etc. It can be done completely manually, or with the help of RStudio’s +New Project functionality, or a designated package, such as +ProjectTemplate.

+
+
+ +
+
+

Tip: ProjectTemplate - a possible +solution +

+
+

One way to automate the management of projects is to install the +third-party package, ProjectTemplate. This package will set +up an ideal directory structure for project management. This is very +useful as it enables you to have your analysis pipeline/workflow +organised and structured. Together with the default RStudio project +functionality and Git you will be able to keep track of your work as +well as be able to share your work with collaborators.

+
    +
  1. Install ProjectTemplate.
  2. +
  3. Load the library
  4. +
  5. Initialise the project:
  6. +
+
+

R +

+
+install.packages("ProjectTemplate")
+library("ProjectTemplate")
+create.project("../my_project_2", merge.strategy = "allow.non.conflict")
+
+

For more information on ProjectTemplate and its functionality visit +the home page ProjectTemplate

+
+
+
+

Make code readable +

+
+

The most important part of writing code is making it readable and +understandable. You want someone else to be able to pick up your code +and be able to understand what it does: more often than not this someone +will be you 6 months down the line, who will otherwise be cursing +past-self.

+

Documentation: tell us what and why, not how +

+
+

When you first start out, your comments will often describe what a +command does, since you’re still learning yourself and it can help to +clarify concepts and remind you later. However, these comments aren’t +particularly useful later on when you don’t remember what problem your +code is trying to solve. Try to also include comments that tell you +why you’re solving a problem, and what problem that +is. The how can come after that: it’s an implementation detail +you ideally shouldn’t have to worry about.

+

Keep your code modular +

+
+

Our recommendation is that you should separate your functions from +your analysis scripts, and store them in a separate file that you +source when you open the R session in your project. This +approach is nice because it leaves you with an uncluttered analysis +script, and a repository of useful functions that can be loaded into any +analysis script in your project. It also lets you group related +functions together easily.

+

Break down problem into bite size pieces +

+
+

When you first start out, problem solving and function writing can be +daunting tasks, and hard to separate from code inexperience. Try to +break down your problem into digestible chunks and worry about the +implementation details later: keep breaking down the problem into +smaller and smaller functions until you reach a point where you can code +a solution, and build back up from there.

+

Know that your code is doing the right thing +

+
+

Make sure to test your functions!

+

Don’t repeat yourself +

+
+

Functions enable easy reuse within a project. If you see blocks of +similar lines of code through your project, those are usually candidates +for being moved into functions.

+

If your calculations are performed through a series of functions, +then the project becomes more modular and easier to change. This is +especially the case for which a particular input always gives a +particular output.

+

Remember to be stylish +

+
+

Apply consistent style to your code.

+
+
+ +
+
+

Keypoints +

+
+
    +
  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+
+
+
+
+
+
+
+ + +
+ + +
+ + + + + diff --git a/instructor/discuss.html b/instructor/discuss.html new file mode 100644 index 000000000..657bdfa46 --- /dev/null +++ b/instructor/discuss.html @@ -0,0 +1,445 @@ + +R for Reproducible Scientific Analysis: Discussion +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Discussion

+

Last updated on 2023-10-26 | + + Edit this page

+ + + + + +
+ +
+ + +

Please see our other R +lesson for a different presentation of these concepts.

+ + +
+
+ + +
+
+ + + diff --git a/instructor/images.html b/instructor/images.html new file mode 100644 index 000000000..0608da5f0 --- /dev/null +++ b/instructor/images.html @@ -0,0 +1,643 @@ + + + + + +R for Reproducible Scientific Analysis: All Images + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + + +
+
+ + +

Introduction to R and RStudio

+
+

Figure 1

+ +
RStudio layout

+

Figure 2

+ +
RStudio layout with .R file open

Project Management With RStudio

+
+

Figure 1

+ +
Screenshot of file manager demonstrating bad project organisation

Seeking Help

+

Data Structures

+

Exploring Data Frames

+

Subsetting Data

+
+

Figure 1

+ +
Inequality testing

+

Figure 2

+ +
Inequality testing: results of recycling

Control Flow

+

Creating Publication-Quality Graphics with ggplot2

+
+

Figure 1

+ +
Blank plot, before adding any mapping aesthetics to ggplot().

+

Figure 2

+ +
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

+

Figure 3

+ +
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

+

Figure 4

+ +
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +

+

Figure 5

+ +
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +

+

Figure 6

+

+

Figure 7

+

+

Figure 8

+

+

Figure 9

+

+

Figure 10

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

+

Figure 11

+

+

Figure 12

+ +
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

+

Figure 13

+ +
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

+

Figure 14

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

+

Figure 15

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

+

Figure 16

+

+

Figure 17

+

+

Figure 18

+

+

Figure 19

+

Vectorization

+
+

Figure 1

+ +
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

+

Figure 2

+ +
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

Functions Explained

+

Writing Data

+

Splitting and Combining Data Frames with plyr

+
+

Figure 1

+ +
Split apply combine

+

Figure 2

+ +
Full apply suite

Data Frame Manipulation with dplyr

+
+

Figure 1

+ +

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.

+
+

Figure 2

+ +
Diagram illustrating how the group by function oraganizes a data frame into groups

+

Figure 3

+ +
Diagram illustrating the use of group by and summarize together to create a new variable

+

Figure 4

+

+

Figure 5

+

+

Figure 6

+

Data Frame Manipulation with tidyr

+
+

Figure 1

+ +
Diagram illustrating the difference between a wide versus long layout of a data frame

+

Figure 2

+ +
Diagram illustrating the wide format of the gapminder data frame

+

Figure 3

+ +
Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format

+

Figure 4

+ +
Diagram illustrating the long format of the gapminder data

Producing Reports With knitr

+
+

Figure 1

+ +
Screenshot of the New R Markdown file dialogue box in RStudio

+

Figure 2

+

+

Figure 3

+

RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.

+

Writing Good Software

+
+
+
+
+ + +
+ + +
+ + + + + diff --git a/instructor/index.html b/instructor/index.html new file mode 100644 index 000000000..bff8efe7c --- /dev/null +++ b/instructor/index.html @@ -0,0 +1,624 @@ + +R for Reproducible Scientific Analysis: Summary and Schedule +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+

Summary and Schedule

+ + +

an introduction to R for non-programmers using gapminder +data

+

The goal of this lesson is to teach novice programmers to write +modular code and best practices for using R for data analysis. R is +commonly used in many scientific disciplines for statistical analysis +and its array of third-party packages. We find that many scientists who +come to Software Carpentry workshops use R and want to learn more. The +emphasis of these materials is to give attendees a strong foundation in +the fundamentals of R, and to teach best practices for scientific +computing: breaking down analyses into modular units, task automation, +and encapsulation.

+

Note that this workshop will focus on teaching the fundamentals of +the programming language R, and will not teach statistical analysis.

+

The lesson contains more material than can be taught in a day. The instructor notes page has some +suggested lesson plans suitable for a one or half day workshop.

+

A variety of third party packages are used throughout this workshop. +These are not necessarily the best, nor are they comprehensive, but they +are packages we find useful, and have been chosen primarily for their +usability.

+
+
+ +
+
+

Prerequisites +

+
+

Understand that computers store data and instructions (programs, +scripts etc.) in files. Files are organised in directories (folders). +Know how to access files not in the working directory by specifying the +path.

+
+
+
+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor. +

+

This lesson assumes you have R and RStudio installed on your +computer.

+
+ + +
+
+ + + diff --git a/instructor/instructor-notes.html b/instructor/instructor-notes.html new file mode 100644 index 000000000..ee9c835ad --- /dev/null +++ b/instructor/instructor-notes.html @@ -0,0 +1,641 @@ + + + + + +R for Reproducible Scientific Analysis: Instructor Notes + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + + +
+
+

Instructor Notes

+ + +

Timing +

+
+

Leave about 30 minutes at the start of each workshop and another 15 +mins at the start of each session for technical difficulties like WiFi +and installing things (even if you asked students to install in advance, +longer if not).

+

Lesson Plans +

+
+

The lesson contains much more material than can be taught in a day. +Instructors will need to pick an appropriate subset of episodes to use +in a standard one day course.

+

Some suggested paths through the material are:

+

(suggested by @liz-is)

+
    +
  • 01 Introduction to R and RStudio
  • +
  • 04 Data Structures
  • +
  • 05 Exploring Data Frames (“Realistic example” section onwards)
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2
  • +
  • 10 Functions Explained
  • +
  • 13 Dataframe Manipulation with dplyr
  • +
  • 15 Producing Reports With knitr
  • +
+

(suggested by @naupaka)

+
    +
  • 01 Introduction to R and RStudio
  • +
  • 02 Project Management With RStudio
  • +
  • 03 Seeking Help
  • +
  • 04 Data Structures
  • +
  • 05 Exploring Data Frames
  • +
  • 06 Subsetting Data
  • +
  • 09 Vectorization
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2 OR 13 +Dataframe Manipulation with dplyr
  • +
  • 15 Producing Reports With knitr
  • +
+

A half day course could consist of (suggested by @karawoo):

+
    +
  • 01 Introduction to R and RStudio
  • +
  • 04 Data Structures (only creating vectors with +c())
  • +
  • 05 Exploring Data Frames (“Realistic example” section onwards)
  • +
  • 06 Subsetting Data (excluding factor, matrix and list +subsetting)
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2
  • +

Setting up git in RStudio +

+
+

There can be difficulties linking git to RStudio depending on the +operating system and the version of the operating system. To make sure +Git is properly installed and configured, the learners should go to the +Options window in the RStudio application.

+
    +
  • +Mac OS X: +
      +
    • Go RStudio -> Preferences… -> Git/SVN
    • +
    • Check and see whether there is a path to a file in the “Git +executable” window. If not, the next challenge is figuring out where Git +is located.
    • +
    • In the terminal enter which git and you will get a path +to the git executable. In the “Git executable” window you may have +difficulties finding the directory since OS X hides many of the +operating system files. While the file selection window is open, +pressing “Command-Shift-G” will pop up a text entry box where you will +be able to type or paste in the full path to your git executable: +e.g. /usr/bin/git or whatever else it might be.
    • +
    +
  • +
  • +Windows: +
      +
    • Go Tools -> Global options… -> Git/SVN
    • +
    • If you use the Software Carpentry Installer, then ‘git.exe’ should +be installed at C:/Program Files/Git/bin/git.exe.
    • +
    +
  • +
+

To prevent the learners from having to re-enter their password each +time they push a commit to GitHub, this command (which can be run from a +bash prompt) will make it so they only have to enter their password +once:

+
+

BASH +

+
$ git config --global credential.helper 'cache --timeout=10000000'
+
+

Pulling in Data +

+
+

The easiest way to get the data used in this lesson during a workshop +is to have attendees download the raw data from gapminder-data and gapminder-data-wide.

+

Attendees can use the File - Save As dialog in their +browser to save the file.

+

Overall +

+
+

Make sure to emphasize good practices: put code in scripts, and make +sure they’re version controlled. Encourage students to create script +files for challenges.

+

If you’re working in a cloud environment, get them to upload the +gapminder data after the second lesson.

+

Make sure to emphasize that matrices are vectors underneath the hood +and data frames are lists underneath the hood: this will explain a lot +of the esoteric behaviour encountered in basic operations.

+

Vector recycling and function stacks are probably best explained with +diagrams on a whiteboard.

+

Be sure to actually go through examples of an R help page: help files +can be intimidating at first, but knowing how to read them is +tremendously useful.

+

Be sure to show the CRAN task views, look at one of the topics.

+

There’s a lot of content: move quickly through the earlier lessons. +Their extensiveness is mostly for purposes of learning by osmosis: so +that their memory will trigger later when they encounter a problem or +some esoteric behaviour.

+

Key lessons to take time on:

+
    +
  • Data subsetting - conceptually difficult for novices
  • +
  • Functions - learners especially struggle with this
  • +
  • Data structures - worth being thorough, but you can go through it +quickly.
  • +
+

Don’t worry about being correct or knowing the material +back-to-front. Use mistakes as teaching moments: the most vital skill +you can impart is how to debug and recover from unexpected errors.

+

Introduction to R and RStudio

+

Project Management With RStudio

+

Seeking Help

+

Data Structures

+

Exploring Data Frames

+

Subsetting Data

+

Control Flow

+

Creating Publication-Quality Graphics with ggplot2

+

Vectorization

+

Functions Explained

+

Writing Data

+

Splitting and Combining Data Frames with plyr

+

Data Frame Manipulation with dplyr

+

Data Frame Manipulation with tidyr

+

Producing Reports With knitr

+

Writing Good Software

+
+
+
+
+ + +
+ + +
+ + + + + diff --git a/instructor/key-points.html b/instructor/key-points.html new file mode 100644 index 000000000..fb94941a2 --- /dev/null +++ b/instructor/key-points.html @@ -0,0 +1,619 @@ + + + + + +R for Reproducible Scientific Analysis: Key Points + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + + +
+
+ + +

Introduction to R and RStudio

+
+
    +
  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +

Project Management With RStudio

+
+
    +
  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +

Seeking Help

+
+
    +
  • Use help() to get online help in R.
  • +

Data Structures

+
+
    +
  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +

Exploring Data Frames

+
+
    +
  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +

Subsetting Data

+
+
    +
  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +

Control Flow

+
+
    +
  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +

Creating Publication-Quality Graphics with ggplot2

+
+
    +
  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +

Vectorization

+
+
    +
  • Use vectorized operations instead of loops.
  • +

Functions Explained

+
+
    +
  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +

Writing Data

+
+
    +
  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +

Splitting and Combining Data Frames with plyr

+
+
    +
  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +

Data Frame Manipulation with dplyr

+
+
    +
  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +

Data Frame Manipulation with tidyr

+
+
    +
  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +

Producing Reports With knitr

+
+
    +
  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +

Writing Good Software

+
+
    +
  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+
+
+
+ + +
+ + +
+ + + + + diff --git a/instructor/profiles.html b/instructor/profiles.html new file mode 100644 index 000000000..82d9b3080 --- /dev/null +++ b/instructor/profiles.html @@ -0,0 +1,401 @@ + +R for Reproducible Scientific Analysis: Learner Profiles +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Learner Profiles

+ +

This is a placeholder file. Please add content here.

+ +
+
+ + +
+
+ + + diff --git a/instructor/reference.html b/instructor/reference.html new file mode 100644 index 000000000..d7417cfee --- /dev/null +++ b/instructor/reference.html @@ -0,0 +1,963 @@ + +R for Reproducible Scientific Analysis: Reference +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + +
+
+

Reference

+

Last updated on 2023-10-26 | + + Edit this page

+ + + + + +
+ +
+ + + +

Reference +

+

+Introduction to R and +RStudio +

+
  • Use the escape key to cancel incomplete commands or running code +(Ctrl+C) if you’re using R from the shell.
  • +
  • Basic arithmetic operations follow standard order of precedence: +
    • Brackets: (, ) +
    • +
    • Exponents: ^ or ** +
    • +
    • Divide: / +
    • +
    • Multiply: * +
    • +
    • Add: + +
    • +
    • Subtract: - +
    • +
  • +
  • Scientific notation is available, e.g: 2e-3 +
  • +
  • Anything to the right of a # is a comment, R will +ignore this!
  • +
  • Functions are denoted by function_name(). Expressions +inside the brackets are evaluated before being passed to the function, +and functions can be nested.
  • +
  • Mathematical functions: exp, sin, +log, log10, log2 etc.
  • +
  • Comparison operators: <, <=, +>, >=, ==, +!= +
  • +
  • Use all.equal to compare numbers!
  • +
  • +<- is the assignment operator. Anything to the right +is evaluate, then stored in a variable named to the left.
  • +
  • +ls lists all variables and functions you’ve +created
  • +
  • +rm can be used to remove them
  • +
  • When assigning values to function arguments, you must use +=.
  • +

+Project management with +RStudio +

+
  • To create a new project, go to File -> New Project
  • +
  • Install the packrat package to create self-contained +projects
  • +
  • +install.packages to install packages from CRAN
  • +
  • +library to load a package into R
  • +
  • +packrat::status to check whether all packages +referenced in your scripts have been installed.
  • +

+Seeking help +

+
  • To access help for a function type ?function_name or +help(function_name) +
  • +
  • Use quotes for special operators e.g. ?"+" +
  • +
  • Use fuzzy search if you can’t remember a name ‘??search_term’
  • +
  • +CRAN task +views are a good starting point.
  • +
  • +Stack Overflow is a good +place to get help with your code. +
    • +?dput will dump data you are working from so others can +load it easily.
    • +
    • +sessionInfo() will give details of your setup that +others may need for debugging.
    • +
  • +

+Data structures +

+

Individual values in R must be one of 5 data types, +multiple values can be grouped in data structures.

+

Data types

+
  • typeof(object) gives information about an items data +type.

  • +
  • +

    There are 5 main data types:

    +
    • +?numeric real (decimal) numbers
    • +
    • +?integer whole numbers only
    • +
    • +?character text
    • +
    • +?complex complex numbers
    • +
    • +?logical TRUE or FALSE values
    • +

    Special types:

    +
    • +?NA missing values
    • +
    • +?NaN “not a number” for undefined values +(e.g. 0/0).
    • +
    • +?Inf, -Inf infinity.
    • +
    • +?NULL a data structure that doesn’t exist
    • +

    NA can occur in any atomic vector. NaN, and +Inf can only occur in complex, integer or numeric type +vectors. Atomic vectors are the building blocks for all other data +structures. A NULL value will occur in place of an entire +data structure (but can occur as list elements).

    +
  • +

Basic data structures in R:

+
  • atomic ?vector (can only contain one type)
  • +
  • +?list (containers for other objects)
  • +
  • +?data.frame two dimensional objects whose columns can +contain different types of data
  • +
  • +?matrix two dimensional objects that can contain only +one type of data.
  • +
  • +?factor vectors that contain predefined categorical +data.
  • +
  • +?array multi-dimensional objects that can only contain +one type of data
  • +

Remember that matrices are really atomic vectors underneath the hood, +and that data.frames are really lists underneath the hood (this explains +some of the weirder behaviour of R).

+

Vectors

+
  • +?vector() All items in a vector must be the same +type.
  • +
  • Items can be converted from one type to another using +coercion.
  • +
  • The concatenate function ‘c()’ will append items to a vector.
  • +
  • +seq(from=0, to=1, by=1) will create a sequence of +numbers.
  • +
  • Items in a vector can be named using the names() +function.
  • +

Factors

+
  • +?factor() Factors are a data structure designed to +store categorical data.
  • +
  • +levels() shows the valid values that can be stored in a +vector of type factor.
  • +

Lists

+
  • +?list() Lists are a data structure designed to store +data of different types.
  • +

Matrices

+
  • +?matrix() Matrices are a data structure designed to +store 2-dimensional data.
  • +

Data +Frames

+
  • +?data.frame is a key data structure. It is a +list of vectors.
  • +
  • +cbind() will add a column (vector) to a +data.frame.
  • +
  • +rbind() will add a row (list) to a data.frame.
  • +

Useful functions for querying data structures:

+
  • +?str structure, prints out a summary of the whole data +structure
  • +
  • +?typeof tells you the type inside an atomic vector
  • +
  • +?class what is the data structure?
  • +
  • +?head print the first n elements (rows for +two-dimensional objects)
  • +
  • +?tail print the last n elements (rows for +two-dimensional objects)
  • +
  • +?rownames, ?colnames, +?dimnames retrieve or modify the row names and column names +of an object.
  • +
  • +?names retrieve or modify the names of an atomic vector +or list (or columns of a data.frame).
  • +
  • +?length get the number of elements in an atomic +vector
  • +
  • +?nrow, ?ncol, ?dim get the +dimensions of a n-dimensional object (Won’t work on atomic vectors or +lists).
  • +

+Exploring Data +Frames +

+
  • +read.csv to read in data in a regular structure +
    • +sep argument to specify the separator +
      • “,” for comma separated
      • +
      • “\t” for tab separated
      • +
    • +
    • Other arguments: +
      • +header=TRUE if there is a header row
      • +
    • +
  • +

+Subsetting data +

+
  • +

    Elements can be accessed by:

    +
    • Index
    • +
    • Name
    • +
    • Logical vectors
    • +
  • +
  • +

    [ single square brackets:

    +
    • +extract single elements or subset vectors
    • +
    • e.g.x[1] extracts the first item from vector x.
    • +
    • +extract single elements of a list. The returned value will +be another list().
    • +
    • +extract columns from a data.frame
    • +
  • +
  • +

    [ with two arguments to:

    +
    • +extract rows and/or columns of +
      • matrices
      • +
      • data.frames
      • +
      • e.g. x[1,2] will extract the value in row 1, column +2.
      • +
      • e.g. x[2,:] will extract the entire second column of +values.
      • +
    • +
  • +
  • [[ double square brackets to extract items from +lists.

  • +
  • $ to access columns or list elements by +name

  • +
  • negative indices skip elements

  • +

+Control flow +

+
  • Use if condition to start a conditional statement, +else if condition to provide additional tests, and +else to provide a default
  • +
  • The bodies of the branches of conditional statements must be +indented.
  • +
  • Use == to test for equality.
  • +
  • +%in% will return a TRUE/FALSE +indicating if there is a match between an element and a vector.
  • +
  • +X && Y is only true if both X and Y are +TRUE.
  • +
  • +X || Y is true if either X or Y, or both, are +TRUE.
  • +
  • Zero is considered FALSE; all other numbers are +considered TRUE +
  • +
  • Nest loops to operate on multi-dimensional data.
  • +

+Creating publication quality +graphics +

+
  • figures can be created with the grammar of graphics: +
    • library(ggplot2)
    • +
    • +ggplot to create the base figure
    • +
    • +aesthetics specify the data axes, shape, color, and +data size
    • +
    • +geometry functions specify the type of plot, +e.g. point, line, density, +box +
    • +
    • +geometry functions also add statistical transforms, +e.g. geom_smooth +
    • +
    • +scale functions change the mapping from data to +aesthetics
    • +
    • +facet functions stratify the figure into panels
    • +
    • +aesthetics apply to individual layers, or can be set +for the whole plot inside ggplot.
    • +
    • +theme functions change the overall look of the +plot
    • +
    • order of layers matters!
    • +
    • +ggsave to save a figure.
    • +
  • +

+Vectorization +

+
  • Most functions and operations apply to each element of a vector
  • +
  • +* applies element-wise to matrices
  • +
  • +%*% for true matrix multiplication
  • +
  • +any() will return TRUE if any element of a +vector is TRUE +
  • +
  • +all() will return TRUE if all +elements of a vector are TRUE +
  • +

+Functions explained +

+
  • ?"function"
  • +
  • Put code whose parameters change frequently in a function, then call +it with different parameter values to customize its behavior.
  • +
  • The last line of a function is returned, or you can use +return explicitly
  • +
  • Any code written in the body of the function will preferably look +for variables defined inside the function.
  • +
  • Document Why, then What, then lastly How (if the code isn’t self +explanatory)
  • +

+Writing data +

+
  • +write.table to write out objects in regular format
  • +
  • set quote=FALSE so that text isn’t wrapped in +" marks
  • +

+Split-apply-combine +

+
  • Use the xxply family of functions to apply functions to +groups within some data.
  • +
  • the first letter, array , data.frame or +list corresponds to the input data
  • +
  • the second letter denotes the output data structure
  • +
  • Anonymous functions (those not assigned a name) are used inside the +plyr family of functions on groups within data.
  • +

+Dataframe manipulation with dplyr +

+
  • library(dplyr)
  • +
  • +?select to extract variables by name.
  • +
  • +?filter return rows with matching conditions.
  • +
  • +?group_by group data by one of more variables.
  • +
  • +?summarize summarize multiple values to a single +value.
  • +
  • +?mutate add new variables to a data.frame.
  • +
  • Combine operations using the ?"%>%" pipe +operator.
  • +

+Dataframe manipulation with tidyr +

+
  • library(tidyr)
  • +
  • +?pivot_longer convert data from wide to +long format.
  • +
  • +?pivot_wider convert data from long to +wide format.
  • +
  • +?separate split a single value into multiple +values.
  • +
  • +?unite merge multiple values into a single value.
  • +

+Producing reports with +knitr +

+
  • Value of reproducible reports
  • +
  • Basics of Markdown
  • +
  • R code chunks
  • +
  • Chunk options
  • +
  • Inline R code
  • +
  • Other output formats
  • +

+Best practices for writing good +code +

+
  • Program defensively, i.e., assume that errors are going to arise, +and write code to detect them when they do.
  • +
  • Write tests before writing code in order to help determine exactly +what that code is supposed to do.
  • +
  • Know what code is supposed to do before trying to debug it.
  • +
  • Make it fail every time.
  • +
  • Make it fail fast.
  • +
  • Change one thing at a time, and for a reason.
  • +
  • Keep track of what you’ve done.
  • +
  • Be humble
  • +

Glossary +

+
argument
+
+A value given to a function or program when it runs. The term is often +used interchangeably (and inconsistently) with parameter. +
+
assign
+
+To give a value a name by associating a variable with it. +
+
body
+
+(of a function): the statements that are executed when a function runs. +
+
comment
+
+A remark in a program that is intended to help human readers understand +what is going on, but is ignored by the computer. Comments in Python, R, +and the Unix shell start with a # character and run to the +end of the line; comments in SQL start with --, and other +languages have other conventions. +
+
comma-separated values
+
+(CSV) A common textual representation for tables in which the values in +each row are separated by commas. +
+
delimiter
+
+A character or characters used to separate individual values, such as +the commas between columns in a CSV file. +
+
documentation
+
+Human-language text written to explain what software does, how it works, +or how to use it. +
+
floating-point number
+
+A number containing a fractional part and an exponent. See also: integer. +
+
for loop
+
+A loop that is executed once for each value in some kind of set, list, +or range. See also: while loop. +
+
index
+
+A subscript that specifies the location of a single value in a +collection, such as a single pixel in an image. +
+
integer
+
+A whole number, such as -12343. See also: floating-point number. +
+
library
+
+In R, the directory(ies) where packages are +stored. +
+
package
+
+A collection of R functions, data and compiled code in a well-defined +format. Packages are stored in a library and +loaded using the library() function. +
+
parameter
+
+A variable named in the function’s declaration that is used to hold a +value passed into the call. The term is often used interchangeably (and +inconsistently) with argument. +
+
return statement
+
+A statement that causes a function to stop executing and return a value +to its caller immediately. +
+
sequence
+
+A collection of information that is presented in a specific order. +
+
shape
+
+An array’s dimensions, represented as a vector. For example, a 5×3 +array’s shape is (5,3). +
+
string
+
+Short for “character string”, a sequence of zero +or more characters. +
+
syntax error
+
+A programming error that occurs when statements are in an order or +contain characters not expected by the programming language. +
+
type
+
+The classification of something in a program (for example, the contents +of a variable) as a kind of number (e.g. floating-point, integer), string, or something else. In R the command typeof() +is used to query a variables type. +
+
while loop
+
+A loop that keeps executing as long as some condition is true. See also: +for loop. +
+
+
+ + +
+
+ + + diff --git a/key-points.html b/key-points.html new file mode 100644 index 000000000..bcbd8388e --- /dev/null +++ b/key-points.html @@ -0,0 +1,623 @@ + + + + + +R for Reproducible Scientific Analysis: Key Points + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+
+ + + + + + +
+
+ + +

Introduction to R and RStudio

+
+
    +
  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +

Project Management With RStudio

+
+
    +
  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +

Seeking Help

+
+
    +
  • Use help() to get online help in R.
  • +

Data Structures

+
+
    +
  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +

Exploring Data Frames

+
+
    +
  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +

Subsetting Data

+
+
    +
  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +

Control Flow

+
+
    +
  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +

Creating Publication-Quality Graphics with ggplot2

+
+
    +
  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +

Vectorization

+
+
    +
  • Use vectorized operations instead of loops.
  • +

Functions Explained

+
+
    +
  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +

Writing Data

+
+
    +
  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +

Splitting and Combining Data Frames with plyr

+
+
    +
  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +

Data Frame Manipulation with dplyr

+
+
    +
  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +

Data Frame Manipulation with tidyr

+
+
    +
  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +

Producing Reports With knitr

+
+
    +
  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +

Writing Good Software

+
+
    +
  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+
+
+
+ + +
+ + +
+ + + + + diff --git a/link.svg b/link.svg new file mode 100644 index 000000000..88ad82769 --- /dev/null +++ b/link.svg @@ -0,0 +1,12 @@ + + + + + + diff --git a/md5sum.txt b/md5sum.txt new file mode 100644 index 000000000..8f9f0c33a --- /dev/null +++ b/md5sum.txt @@ -0,0 +1,27 @@ +"file" "checksum" "built" "date" +"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2023-10-26" +"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2023-10-26" +"config.yaml" "810028d39c377c82aef9239cb1ec0dd3" "site/built/config.yaml" "2023-10-26" +"index.md" "86c8fb559b13d1695d55b52dd6cbf574" "site/built/index.md" "2023-10-26" +"episodes/01-rstudio-intro.Rmd" "5e73c9f0c60d736ea458abe379ecef68" "site/built/01-rstudio-intro.md" "2023-10-26" +"episodes/02-project-intro.Rmd" "94e7911ebdd59fbc30de86ed1d84d4df" "site/built/02-project-intro.md" "2023-10-26" +"episodes/03-seeking-help.Rmd" "d24c310b8f36930e70379458f3c93461" "site/built/03-seeking-help.md" "2023-10-26" +"episodes/04-data-structures-part1.Rmd" "5ec938f71a9cec633cef9329d214c3a0" "site/built/04-data-structures-part1.md" "2023-10-26" +"episodes/05-data-structures-part2.Rmd" "de6c6ee224fa7201674d87844c9ede02" "site/built/05-data-structures-part2.md" "2023-10-26" +"episodes/06-data-subsetting.Rmd" "5d4ce8731ab37ddea81874d63ae1ce86" "site/built/06-data-subsetting.md" "2023-10-26" +"episodes/07-control-flow.Rmd" "6a8691c8668737e4202f49b52aeb8ac6" "site/built/07-control-flow.md" "2023-10-26" +"episodes/08-plot-ggplot2.Rmd" "775bc2b258e11b4af447c7286bca2dd4" "site/built/08-plot-ggplot2.md" "2023-10-26" +"episodes/09-vectorization.Rmd" "e229eb061b3f072a132c4b31bbc2fdb0" "site/built/09-vectorization.md" "2023-10-26" +"episodes/10-functions.Rmd" "14edd4cf50edb8fefeb987a17d740e1a" "site/built/10-functions.md" "2023-10-26" +"episodes/11-writing-data.Rmd" "8b26e062dddd2394d00c6847ff0b7505" "site/built/11-writing-data.md" "2023-10-26" +"episodes/12-plyr.Rmd" "909597e71c188c682b5039036b4e95cf" "site/built/12-plyr.md" "2023-10-26" +"episodes/13-dplyr.Rmd" "3ad3687a1c860ddcf30ddcbb375153fb" "site/built/13-dplyr.md" "2023-10-26" +"episodes/14-tidyr.Rmd" "6ceb2a517a291c565cfbc0f76e2fb567" "site/built/14-tidyr.md" "2023-10-26" +"episodes/15-knitr-markdown.Rmd" "65188e4a8eaf3d04c6284db65c48c83e" "site/built/15-knitr-markdown.md" "2023-10-26" +"episodes/16-wrap-up.Rmd" "c5ce0d34a37b7a99624ad1d6ac482256" "site/built/16-wrap-up.md" "2023-10-26" +"instructors/instructor-notes.md" "5ce85301c3e8d78b4b8682ae8e6bb7ff" "site/built/instructor-notes.md" "2023-10-26" +"learners/discuss.md" "42ad66ab1907e030914dbb2a94376a47" "site/built/discuss.md" "2023-10-26" +"learners/reference.md" "b606f57847b81651e8102925ff3d19c1" "site/built/reference.md" "2023-10-26" +"learners/setup.md" "f888f8a54b071715c0cf56896e650c00" "site/built/setup.md" "2023-10-26" +"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2023-10-26" +"renv/profiles/lesson-requirements/renv.lock" "d0863f3009013edce68caa0b832b8754" "site/built/renv.lock" "2023-10-26" diff --git a/mstile-150x150.png b/mstile-150x150.png new file mode 100644 index 000000000..8136f75e7 Binary files /dev/null and b/mstile-150x150.png differ diff --git a/pkgdown.css b/pkgdown.css new file mode 100644 index 000000000..80ea5b838 --- /dev/null +++ b/pkgdown.css @@ -0,0 +1,384 @@ +/* Sticky footer */ + +/** + * Basic idea: https://philipwalton.github.io/solved-by-flexbox/demos/sticky-footer/ + * Details: https://github.com/philipwalton/solved-by-flexbox/blob/master/assets/css/components/site.css + * + * .Site -> body > .container + * .Site-content -> body > .container .row + * .footer -> footer + * + * Key idea seems to be to ensure that .container and __all its parents__ + * have height set to 100% + * + */ + +html, body { + height: 100%; +} + +body { + position: relative; +} + +body > .container { + display: flex; + height: 100%; + flex-direction: column; +} + +body > .container .row { + flex: 1 0 auto; +} + +footer { + margin-top: 45px; + padding: 35px 0 36px; + border-top: 1px solid #e5e5e5; + color: #666; + display: flex; + flex-shrink: 0; +} +footer p { + margin-bottom: 0; +} +footer div { + flex: 1; +} +footer .pkgdown { + text-align: right; +} +footer p { + margin-bottom: 0; +} + +img.icon { + float: right; +} + +/* Ensure in-page images don't run outside their container */ +.contents img { + max-width: 100%; + height: auto; +} + +/* Fix bug in bootstrap (only seen in firefox) */ +summary { + display: list-item; +} + +/* Typographic tweaking ---------------------------------*/ + +.contents .page-header { + margin-top: calc(-60px + 1em); +} + +dd { + margin-left: 3em; +} + +/* Section anchors ---------------------------------*/ + +a.anchor { + display: none; + margin-left: 5px; + width: 20px; + height: 20px; + + background-image: url(./link.svg); + background-repeat: no-repeat; + background-size: 20px 20px; + background-position: center center; +} + +h1:hover .anchor, +h2:hover .anchor, +h3:hover .anchor, +h4:hover .anchor, +h5:hover .anchor, +h6:hover .anchor { + display: inline-block; +} + +/* Fixes for fixed navbar --------------------------*/ + +.contents h1, .contents h2, .contents h3, .contents h4 { + padding-top: 60px; + margin-top: -40px; +} + +/* Navbar submenu --------------------------*/ + +.dropdown-submenu { + position: relative; +} + +.dropdown-submenu>.dropdown-menu { + top: 0; + left: 100%; + margin-top: -6px; + margin-left: -1px; + border-radius: 0 6px 6px 6px; +} + +.dropdown-submenu:hover>.dropdown-menu { + display: block; +} + +.dropdown-submenu>a:after { + display: block; + content: " "; + float: right; + width: 0; + height: 0; + border-color: transparent; + border-style: solid; + border-width: 5px 0 5px 5px; + border-left-color: #cccccc; + margin-top: 5px; + margin-right: -10px; +} + +.dropdown-submenu:hover>a:after { + border-left-color: #ffffff; +} + +.dropdown-submenu.pull-left { + float: none; +} + +.dropdown-submenu.pull-left>.dropdown-menu { + left: -100%; + margin-left: 10px; + border-radius: 6px 0 6px 6px; +} + +/* Sidebar --------------------------*/ + +#pkgdown-sidebar { + margin-top: 30px; + position: -webkit-sticky; + position: sticky; + top: 70px; +} + +#pkgdown-sidebar h2 { + font-size: 1.5em; + margin-top: 1em; +} + +#pkgdown-sidebar h2:first-child { + margin-top: 0; +} + +#pkgdown-sidebar .list-unstyled li { + margin-bottom: 0.5em; +} + +/* bootstrap-toc tweaks ------------------------------------------------------*/ + +/* All levels of nav */ + +nav[data-toggle='toc'] .nav > li > a { + padding: 4px 20px 4px 6px; + font-size: 1.5rem; + font-weight: 400; + color: inherit; +} + +nav[data-toggle='toc'] .nav > li > a:hover, +nav[data-toggle='toc'] .nav > li > a:focus { + padding-left: 5px; + color: inherit; + border-left: 1px solid #878787; +} + +nav[data-toggle='toc'] .nav > .active > a, +nav[data-toggle='toc'] .nav > .active:hover > a, +nav[data-toggle='toc'] .nav > .active:focus > a { + padding-left: 5px; + font-size: 1.5rem; + font-weight: 400; + color: inherit; + border-left: 2px solid #878787; +} + +/* Nav: second level (shown on .active) */ + +nav[data-toggle='toc'] .nav .nav { + display: none; /* Hide by default, but at >768px, show it */ + padding-bottom: 10px; +} + +nav[data-toggle='toc'] .nav .nav > li > a { + padding-left: 16px; + font-size: 1.35rem; +} + +nav[data-toggle='toc'] .nav .nav > li > a:hover, +nav[data-toggle='toc'] .nav .nav > li > a:focus { + padding-left: 15px; +} + +nav[data-toggle='toc'] .nav .nav > .active > a, +nav[data-toggle='toc'] .nav .nav > .active:hover > a, +nav[data-toggle='toc'] .nav .nav > .active:focus > a { + padding-left: 15px; + font-weight: 500; + font-size: 1.35rem; +} + +/* orcid ------------------------------------------------------------------- */ + +.orcid { + font-size: 16px; + color: #A6CE39; + /* margins are required by official ORCID trademark and display guidelines */ + margin-left:4px; + margin-right:4px; + vertical-align: middle; +} + +/* Reference index & topics ----------------------------------------------- */ + +.ref-index th {font-weight: normal;} + +.ref-index td {vertical-align: top; min-width: 100px} +.ref-index .icon {width: 40px;} +.ref-index .alias {width: 40%;} +.ref-index-icons .alias {width: calc(40% - 40px);} +.ref-index .title {width: 60%;} + +.ref-arguments th {text-align: right; padding-right: 10px;} +.ref-arguments th, .ref-arguments td {vertical-align: top; min-width: 100px} +.ref-arguments .name {width: 20%;} +.ref-arguments .desc {width: 80%;} + +/* Nice scrolling for wide elements --------------------------------------- */ + +table { + display: block; + overflow: auto; +} + +/* Syntax highlighting ---------------------------------------------------- */ + +pre, code, pre code { + background-color: #f8f8f8; + color: #333; +} +pre, pre code { + white-space: pre-wrap; + word-break: break-all; + overflow-wrap: break-word; +} + +pre { + border: 1px solid #eee; +} + +pre .img, pre .r-plt { + margin: 5px 0; +} + +pre .img img, pre .r-plt img { + background-color: #fff; +} + +code a, pre a { + color: #375f84; +} + +a.sourceLine:hover { + text-decoration: none; +} + +.fl {color: #1514b5;} +.fu {color: #000000;} /* function */ +.ch,.st {color: #036a07;} /* string */ +.kw {color: #264D66;} /* keyword */ +.co {color: #888888;} /* comment */ + +.error {font-weight: bolder;} +.warning {font-weight: bolder;} + +/* Clipboard --------------------------*/ + +.hasCopyButton { + position: relative; +} + +.btn-copy-ex { + position: absolute; + right: 0; + top: 0; + visibility: hidden; +} + +.hasCopyButton:hover button.btn-copy-ex { + visibility: visible; +} + +/* headroom.js ------------------------ */ + +.headroom { + will-change: transform; + transition: transform 200ms linear; +} +.headroom--pinned { + transform: translateY(0%); +} +.headroom--unpinned { + transform: translateY(-100%); +} + +/* mark.js ----------------------------*/ + +mark { + background-color: rgba(255, 255, 51, 0.5); + border-bottom: 2px solid rgba(255, 153, 51, 0.3); + padding: 1px; +} + +/* vertical spacing after htmlwidgets */ +.html-widget { + margin-bottom: 10px; +} + +/* fontawesome ------------------------ */ + +.fab { + font-family: "Font Awesome 5 Brands" !important; +} + +/* don't display links in code chunks when printing */ +/* source: https://stackoverflow.com/a/10781533 */ +@media print { + code a:link:after, code a:visited:after { + content: ""; + } +} + +/* Section anchors --------------------------------- + Added in pandoc 2.11: https://github.com/jgm/pandoc-templates/commit/9904bf71 +*/ + +div.csl-bib-body { } +div.csl-entry { + clear: both; +} +.hanging-indent div.csl-entry { + margin-left:2em; + text-indent:-2em; +} +div.csl-left-margin { + min-width:2em; + float:left; +} +div.csl-right-inline { + margin-left:2em; + padding-left:1em; +} +div.csl-indent { + margin-left: 2em; +} diff --git a/pkgdown.js b/pkgdown.js new file mode 100644 index 000000000..6f0eee40b --- /dev/null +++ b/pkgdown.js @@ -0,0 +1,108 @@ +/* http://gregfranko.com/blog/jquery-best-practices/ */ +(function($) { + $(function() { + + $('.navbar-fixed-top').headroom(); + + $('body').css('padding-top', $('.navbar').height() + 10); + $(window).resize(function(){ + $('body').css('padding-top', $('.navbar').height() + 10); + }); + + $('[data-toggle="tooltip"]').tooltip(); + + var cur_path = paths(location.pathname); + var links = $("#navbar ul li a"); + var max_length = -1; + var pos = -1; + for (var i = 0; i < links.length; i++) { + if (links[i].getAttribute("href") === "#") + continue; + // Ignore external links + if (links[i].host !== location.host) + continue; + + var nav_path = paths(links[i].pathname); + + var length = prefix_length(nav_path, cur_path); + if (length > max_length) { + max_length = length; + pos = i; + } + } + + // Add class to parent
  • , and enclosing
  • if in dropdown + if (pos >= 0) { + var menu_anchor = $(links[pos]); + menu_anchor.parent().addClass("active"); + menu_anchor.closest("li.dropdown").addClass("active"); + } + }); + + function paths(pathname) { + var pieces = pathname.split("/"); + pieces.shift(); // always starts with / + + var end = pieces[pieces.length - 1]; + if (end === "index.html" || end === "") + pieces.pop(); + return(pieces); + } + + // Returns -1 if not found + function prefix_length(needle, haystack) { + if (needle.length > haystack.length) + return(-1); + + // Special case for length-0 haystack, since for loop won't run + if (haystack.length === 0) { + return(needle.length === 0 ? 0 : -1); + } + + for (var i = 0; i < haystack.length; i++) { + if (needle[i] != haystack[i]) + return(i); + } + + return(haystack.length); + } + + /* Clipboard --------------------------*/ + + function changeTooltipMessage(element, msg) { + var tooltipOriginalTitle=element.getAttribute('data-original-title'); + element.setAttribute('data-original-title', msg); + $(element).tooltip('show'); + element.setAttribute('data-original-title', tooltipOriginalTitle); + } + + if(ClipboardJS.isSupported()) { + $(document).ready(function() { + var copyButton = ""; + + $("div.sourceCode").addClass("hasCopyButton"); + + // Insert copy buttons: + $(copyButton).prependTo(".hasCopyButton"); + + // Initialize tooltips: + $('.btn-copy-ex').tooltip({container: 'body'}); + + // Initialize clipboard: + var clipboardBtnCopies = new ClipboardJS('[data-clipboard-copy]', { + text: function(trigger) { + return trigger.parentNode.textContent.replace(/\n#>[^\n]*/g, ""); + } + }); + + clipboardBtnCopies.on('success', function(e) { + changeTooltipMessage(e.trigger, 'Copied!'); + e.clearSelection(); + }); + + clipboardBtnCopies.on('error', function() { + changeTooltipMessage(e.trigger,'Press Ctrl+C or Command+C to copy'); + }); + }); + } +})(window.jQuery || window.$) diff --git a/pkgdown.yml b/pkgdown.yml new file mode 100644 index 000000000..7c519fc7c --- /dev/null +++ b/pkgdown.yml @@ -0,0 +1,6 @@ +pandoc: 2.19.2 +pkgdown: 2.0.7 +pkgdown_sha: ~ +articles: {} +last_built: 2023-10-26T09:56Z + diff --git a/profiles.html b/profiles.html new file mode 100644 index 000000000..0c36fda1b --- /dev/null +++ b/profiles.html @@ -0,0 +1,402 @@ + +R for Reproducible Scientific Analysis: Learner Profiles +
    + R for Reproducible Scientific Analysis +
    + +
    +
    + + + + + +
    +
    +

    Learner Profiles

    + +

    This is a placeholder file. Please add content here.

    + +
    +
    + + +
    +
    + + + diff --git a/reference.html b/reference.html new file mode 100644 index 000000000..3dff36c9e --- /dev/null +++ b/reference.html @@ -0,0 +1,962 @@ + +R for Reproducible Scientific Analysis: Reference +
    + R for Reproducible Scientific Analysis +
    + +
    +
    + + + + + +
    +
    +

    Reference

    +

    Last updated on 2023-10-26 | + + Edit this page

    + + + +
    + +
    + + + +

    Reference +

    +

    +Introduction to R and +RStudio +

    +
    • Use the escape key to cancel incomplete commands or running code +(Ctrl+C) if you’re using R from the shell.
    • +
    • Basic arithmetic operations follow standard order of precedence: +
      • Brackets: (, ) +
      • +
      • Exponents: ^ or ** +
      • +
      • Divide: / +
      • +
      • Multiply: * +
      • +
      • Add: + +
      • +
      • Subtract: - +
      • +
    • +
    • Scientific notation is available, e.g: 2e-3 +
    • +
    • Anything to the right of a # is a comment, R will +ignore this!
    • +
    • Functions are denoted by function_name(). Expressions +inside the brackets are evaluated before being passed to the function, +and functions can be nested.
    • +
    • Mathematical functions: exp, sin, +log, log10, log2 etc.
    • +
    • Comparison operators: <, <=, +>, >=, ==, +!= +
    • +
    • Use all.equal to compare numbers!
    • +
    • +<- is the assignment operator. Anything to the right +is evaluate, then stored in a variable named to the left.
    • +
    • +ls lists all variables and functions you’ve +created
    • +
    • +rm can be used to remove them
    • +
    • When assigning values to function arguments, you must use +=.
    • +

    +Project management with +RStudio +

    +
    • To create a new project, go to File -> New Project
    • +
    • Install the packrat package to create self-contained +projects
    • +
    • +install.packages to install packages from CRAN
    • +
    • +library to load a package into R
    • +
    • +packrat::status to check whether all packages +referenced in your scripts have been installed.
    • +

    +Seeking help +

    +
    • To access help for a function type ?function_name or +help(function_name) +
    • +
    • Use quotes for special operators e.g. ?"+" +
    • +
    • Use fuzzy search if you can’t remember a name ‘??search_term’
    • +
    • +CRAN task +views are a good starting point.
    • +
    • +Stack Overflow is a good +place to get help with your code. +
      • +?dput will dump data you are working from so others can +load it easily.
      • +
      • +sessionInfo() will give details of your setup that +others may need for debugging.
      • +
    • +

    +Data structures +

    +

    Individual values in R must be one of 5 data types, +multiple values can be grouped in data structures.

    +

    Data types

    +
    • typeof(object) gives information about an items data +type.

    • +
    • +

      There are 5 main data types:

      +
      • +?numeric real (decimal) numbers
      • +
      • +?integer whole numbers only
      • +
      • +?character text
      • +
      • +?complex complex numbers
      • +
      • +?logical TRUE or FALSE values
      • +

      Special types:

      +
      • +?NA missing values
      • +
      • +?NaN “not a number” for undefined values +(e.g. 0/0).
      • +
      • +?Inf, -Inf infinity.
      • +
      • +?NULL a data structure that doesn’t exist
      • +

      NA can occur in any atomic vector. NaN, and +Inf can only occur in complex, integer or numeric type +vectors. Atomic vectors are the building blocks for all other data +structures. A NULL value will occur in place of an entire +data structure (but can occur as list elements).

      +
    • +

    Basic data structures in R:

    +
    • atomic ?vector (can only contain one type)
    • +
    • +?list (containers for other objects)
    • +
    • +?data.frame two dimensional objects whose columns can +contain different types of data
    • +
    • +?matrix two dimensional objects that can contain only +one type of data.
    • +
    • +?factor vectors that contain predefined categorical +data.
    • +
    • +?array multi-dimensional objects that can only contain +one type of data
    • +

    Remember that matrices are really atomic vectors underneath the hood, +and that data.frames are really lists underneath the hood (this explains +some of the weirder behaviour of R).

    +

    Vectors

    +
    • +?vector() All items in a vector must be the same +type.
    • +
    • Items can be converted from one type to another using +coercion.
    • +
    • The concatenate function ‘c()’ will append items to a vector.
    • +
    • +seq(from=0, to=1, by=1) will create a sequence of +numbers.
    • +
    • Items in a vector can be named using the names() +function.
    • +

    Factors

    +
    • +?factor() Factors are a data structure designed to +store categorical data.
    • +
    • +levels() shows the valid values that can be stored in a +vector of type factor.
    • +

    Lists

    +
    • +?list() Lists are a data structure designed to store +data of different types.
    • +

    Matrices

    +
    • +?matrix() Matrices are a data structure designed to +store 2-dimensional data.
    • +

    Data +Frames

    +
    • +?data.frame is a key data structure. It is a +list of vectors.
    • +
    • +cbind() will add a column (vector) to a +data.frame.
    • +
    • +rbind() will add a row (list) to a data.frame.
    • +

    Useful functions for querying data structures:

    +
    • +?str structure, prints out a summary of the whole data +structure
    • +
    • +?typeof tells you the type inside an atomic vector
    • +
    • +?class what is the data structure?
    • +
    • +?head print the first n elements (rows for +two-dimensional objects)
    • +
    • +?tail print the last n elements (rows for +two-dimensional objects)
    • +
    • +?rownames, ?colnames, +?dimnames retrieve or modify the row names and column names +of an object.
    • +
    • +?names retrieve or modify the names of an atomic vector +or list (or columns of a data.frame).
    • +
    • +?length get the number of elements in an atomic +vector
    • +
    • +?nrow, ?ncol, ?dim get the +dimensions of a n-dimensional object (Won’t work on atomic vectors or +lists).
    • +

    +Exploring Data +Frames +

    +
    • +read.csv to read in data in a regular structure +
      • +sep argument to specify the separator +
        • “,” for comma separated
        • +
        • “\t” for tab separated
        • +
      • +
      • Other arguments: +
        • +header=TRUE if there is a header row
        • +
      • +
    • +

    +Subsetting data +

    +
    • +

      Elements can be accessed by:

      +
      • Index
      • +
      • Name
      • +
      • Logical vectors
      • +
    • +
    • +

      [ single square brackets:

      +
      • +extract single elements or subset vectors
      • +
      • e.g.x[1] extracts the first item from vector x.
      • +
      • +extract single elements of a list. The returned value will +be another list().
      • +
      • +extract columns from a data.frame
      • +
    • +
    • +

      [ with two arguments to:

      +
      • +extract rows and/or columns of +
        • matrices
        • +
        • data.frames
        • +
        • e.g. x[1,2] will extract the value in row 1, column +2.
        • +
        • e.g. x[2,:] will extract the entire second column of +values.
        • +
      • +
    • +
    • [[ double square brackets to extract items from +lists.

    • +
    • $ to access columns or list elements by +name

    • +
    • negative indices skip elements

    • +

    +Control flow +

    +
    • Use if condition to start a conditional statement, +else if condition to provide additional tests, and +else to provide a default
    • +
    • The bodies of the branches of conditional statements must be +indented.
    • +
    • Use == to test for equality.
    • +
    • +%in% will return a TRUE/FALSE +indicating if there is a match between an element and a vector.
    • +
    • +X && Y is only true if both X and Y are +TRUE.
    • +
    • +X || Y is true if either X or Y, or both, are +TRUE.
    • +
    • Zero is considered FALSE; all other numbers are +considered TRUE +
    • +
    • Nest loops to operate on multi-dimensional data.
    • +

    +Creating publication quality +graphics +

    +
    • figures can be created with the grammar of graphics: +
      • library(ggplot2)
      • +
      • +ggplot to create the base figure
      • +
      • +aesthetics specify the data axes, shape, color, and +data size
      • +
      • +geometry functions specify the type of plot, +e.g. point, line, density, +box +
      • +
      • +geometry functions also add statistical transforms, +e.g. geom_smooth +
      • +
      • +scale functions change the mapping from data to +aesthetics
      • +
      • +facet functions stratify the figure into panels
      • +
      • +aesthetics apply to individual layers, or can be set +for the whole plot inside ggplot.
      • +
      • +theme functions change the overall look of the +plot
      • +
      • order of layers matters!
      • +
      • +ggsave to save a figure.
      • +
    • +

    +Vectorization +

    +
    • Most functions and operations apply to each element of a vector
    • +
    • +* applies element-wise to matrices
    • +
    • +%*% for true matrix multiplication
    • +
    • +any() will return TRUE if any element of a +vector is TRUE +
    • +
    • +all() will return TRUE if all +elements of a vector are TRUE +
    • +

    +Functions explained +

    +
    • ?"function"
    • +
    • Put code whose parameters change frequently in a function, then call +it with different parameter values to customize its behavior.
    • +
    • The last line of a function is returned, or you can use +return explicitly
    • +
    • Any code written in the body of the function will preferably look +for variables defined inside the function.
    • +
    • Document Why, then What, then lastly How (if the code isn’t self +explanatory)
    • +

    +Writing data +

    +
    • +write.table to write out objects in regular format
    • +
    • set quote=FALSE so that text isn’t wrapped in +" marks
    • +

    +Split-apply-combine +

    +
    • Use the xxply family of functions to apply functions to +groups within some data.
    • +
    • the first letter, array , data.frame or +list corresponds to the input data
    • +
    • the second letter denotes the output data structure
    • +
    • Anonymous functions (those not assigned a name) are used inside the +plyr family of functions on groups within data.
    • +

    +Dataframe manipulation with dplyr +

    +
    • library(dplyr)
    • +
    • +?select to extract variables by name.
    • +
    • +?filter return rows with matching conditions.
    • +
    • +?group_by group data by one of more variables.
    • +
    • +?summarize summarize multiple values to a single +value.
    • +
    • +?mutate add new variables to a data.frame.
    • +
    • Combine operations using the ?"%>%" pipe +operator.
    • +

    +Dataframe manipulation with tidyr +

    +
    • library(tidyr)
    • +
    • +?pivot_longer convert data from wide to +long format.
    • +
    • +?pivot_wider convert data from long to +wide format.
    • +
    • +?separate split a single value into multiple +values.
    • +
    • +?unite merge multiple values into a single value.
    • +

    +Producing reports with +knitr +

    +
    • Value of reproducible reports
    • +
    • Basics of Markdown
    • +
    • R code chunks
    • +
    • Chunk options
    • +
    • Inline R code
    • +
    • Other output formats
    • +

    +Best practices for writing good +code +

    +
    • Program defensively, i.e., assume that errors are going to arise, +and write code to detect them when they do.
    • +
    • Write tests before writing code in order to help determine exactly +what that code is supposed to do.
    • +
    • Know what code is supposed to do before trying to debug it.
    • +
    • Make it fail every time.
    • +
    • Make it fail fast.
    • +
    • Change one thing at a time, and for a reason.
    • +
    • Keep track of what you’ve done.
    • +
    • Be humble
    • +

    Glossary +

    +
    argument
    +
    +A value given to a function or program when it runs. The term is often +used interchangeably (and inconsistently) with parameter. +
    +
    assign
    +
    +To give a value a name by associating a variable with it. +
    +
    body
    +
    +(of a function): the statements that are executed when a function runs. +
    +
    comment
    +
    +A remark in a program that is intended to help human readers understand +what is going on, but is ignored by the computer. Comments in Python, R, +and the Unix shell start with a # character and run to the +end of the line; comments in SQL start with --, and other +languages have other conventions. +
    +
    comma-separated values
    +
    +(CSV) A common textual representation for tables in which the values in +each row are separated by commas. +
    +
    delimiter
    +
    +A character or characters used to separate individual values, such as +the commas between columns in a CSV file. +
    +
    documentation
    +
    +Human-language text written to explain what software does, how it works, +or how to use it. +
    +
    floating-point number
    +
    +A number containing a fractional part and an exponent. See also: integer. +
    +
    for loop
    +
    +A loop that is executed once for each value in some kind of set, list, +or range. See also: while loop. +
    +
    index
    +
    +A subscript that specifies the location of a single value in a +collection, such as a single pixel in an image. +
    +
    integer
    +
    +A whole number, such as -12343. See also: floating-point number. +
    +
    library
    +
    +In R, the directory(ies) where packages are +stored. +
    +
    package
    +
    +A collection of R functions, data and compiled code in a well-defined +format. Packages are stored in a library and +loaded using the library() function. +
    +
    parameter
    +
    +A variable named in the function’s declaration that is used to hold a +value passed into the call. The term is often used interchangeably (and +inconsistently) with argument. +
    +
    return statement
    +
    +A statement that causes a function to stop executing and return a value +to its caller immediately. +
    +
    sequence
    +
    +A collection of information that is presented in a specific order. +
    +
    shape
    +
    +An array’s dimensions, represented as a vector. For example, a 5×3 +array’s shape is (5,3). +
    +
    string
    +
    +Short for “character string”, a sequence of zero +or more characters. +
    +
    syntax error
    +
    +A programming error that occurs when statements are in an order or +contain characters not expected by the programming language. +
    +
    type
    +
    +The classification of something in a program (for example, the contents +of a variable) as a kind of number (e.g. floating-point, integer), string, or something else. In R the command typeof() +is used to query a variables type. +
    +
    while loop
    +
    +A loop that keeps executing as long as some condition is true. See also: +for loop. +
    +
    +
    + + +
    +
    + + + diff --git a/renv.lock b/renv.lock new file mode 100644 index 000000000..e7d643c03 --- /dev/null +++ b/renv.lock @@ -0,0 +1,1085 @@ +{ + "R": { + "Version": "4.3.1", + "Repositories": [ + { + "Name": "carpentries", + "URL": "https://carpentries.r-universe.dev" + }, + { + "Name": "carpentries_archive", + "URL": "https://carpentries.github.io/drat" + }, + { + "Name": "CRAN", + "URL": "https://cran.rstudio.com" + } + ] + }, + "Packages": { + "DiagrammeR": { + "Package": "DiagrammeR", + "Version": "1.0.10", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "RColorBrewer", + "downloader", + "dplyr", + "glue", + "htmltools", + "htmlwidgets", + "igraph", + "magrittr", + "purrr", + "readr", + "rlang", + "rstudioapi", + "scales", + "stringr", + "tibble", + "tidyr", + "viridis", + "visNetwork" + ], + "Hash": "f3de4a4878163a4629a528bbcc6e655d" + }, + "MASS": { + "Package": "MASS", + "Version": "7.3-60", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "methods", + "stats", + "utils" + ], + "Hash": "a56a6365b3fa73293ea8d084be0d9bb0" + }, + "Matrix": { + "Package": "Matrix", + "Version": "1.6-1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "grid", + "lattice", + "methods", + "stats", + "utils" + ], + "Hash": "cb6855ac711958ca734b75e631b2035d" + }, + "R6": { + "Package": "R6", + "Version": "2.5.1", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "470851b6d5d0ac559e9d01bb352b4021" + }, + "RColorBrewer": { + "Package": "RColorBrewer", + "Version": "1.1-3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "45f0398006e83a5b10b72a90663d8d8c" + }, + "Rcpp": { + "Package": "Rcpp", + "Version": "1.0.11", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "methods", + "utils" + ], + "Hash": "ae6cbbe1492f4de79c45fce06f967ce8" + }, + "base64enc": { + "Package": "base64enc", + "Version": "0.1-3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "543776ae6848fde2f48ff3816d0628bc" + }, + "bit": { + "Package": "bit", + "Version": "4.0.5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "d242abec29412ce988848d0294b208fd" + }, + "bit64": { + "Package": "bit64", + "Version": "4.0.5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "bit", + "methods", + "stats", + "utils" + ], + "Hash": "9fe98599ca456d6552421db0d6772d8f" + }, + "bslib": { + "Package": "bslib", + "Version": "0.5.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "base64enc", + "cachem", + "grDevices", + "htmltools", + "jquerylib", + "jsonlite", + "memoise", + "mime", + "rlang", + "sass" + ], + "Hash": "283015ddfbb9d7bf15ea9f0b5698f0d9" + }, + "cachem": { + "Package": "cachem", + "Version": "1.0.8", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "fastmap", + "rlang" + ], + "Hash": "c35768291560ce302c0a6589f92e837d" + }, + "cli": { + "Package": "cli", + "Version": "3.6.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "89e6d8219950eac806ae0c489052048a" + }, + "clipr": { + "Package": "clipr", + "Version": "0.8.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "utils" + ], + "Hash": "3f038e5ac7f41d4ac41ce658c85e3042" + }, + "colorspace": { + "Package": "colorspace", + "Version": "2.1-0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "methods", + "stats" + ], + "Hash": "f20c47fd52fae58b4e377c37bb8c335b" + }, + "cpp11": { + "Package": "cpp11", + "Version": "0.4.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "707fae4bbf73697ec8d85f9d7076c061" + }, + "crayon": { + "Package": "crayon", + "Version": "1.5.2", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "grDevices", + "methods", + "utils" + ], + "Hash": "e8a1e41acf02548751f45c718d55aa6a" + }, + "digest": { + "Package": "digest", + "Version": "0.6.33", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "b18a9cf3c003977b0cc49d5e76ebe48d" + }, + "downloader": { + "Package": "downloader", + "Version": "0.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "digest", + "utils" + ], + "Hash": "f4f2a915e0dedbdf016a83b63477349f" + }, + "dplyr": { + "Package": "dplyr", + "Version": "1.1.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "R6", + "cli", + "generics", + "glue", + "lifecycle", + "magrittr", + "methods", + "pillar", + "rlang", + "tibble", + "tidyselect", + "utils", + "vctrs" + ], + "Hash": "e85ffbebaad5f70e1a2e2ef4302b4949" + }, + "ellipsis": { + "Package": "ellipsis", + "Version": "0.3.2", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "rlang" + ], + "Hash": "bb0eec2fe32e88d9e2836c2f73ea2077" + }, + "evaluate": { + "Package": "evaluate", + "Version": "0.21", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "d59f3b464e8da1aef82dc04b588b8dfb" + }, + "fansi": { + "Package": "fansi", + "Version": "1.0.4", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "grDevices", + "utils" + ], + "Hash": "1d9e7ad3c8312a192dea7d3db0274fde" + }, + "farver": { + "Package": "farver", + "Version": "2.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "8106d78941f34855c440ddb946b8f7a5" + }, + "fastmap": { + "Package": "fastmap", + "Version": "1.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "f7736a18de97dea803bde0a2daaafb27" + }, + "fontawesome": { + "Package": "fontawesome", + "Version": "0.5.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "htmltools", + "rlang" + ], + "Hash": "c2efdd5f0bcd1ea861c2d4e2a883a67d" + }, + "fs": { + "Package": "fs", + "Version": "1.6.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "47b5f30c720c23999b913a1a635cf0bb" + }, + "generics": { + "Package": "generics", + "Version": "0.1.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "methods" + ], + "Hash": "15e9634c0fcd294799e9b2e929ed1b86" + }, + "ggplot2": { + "Package": "ggplot2", + "Version": "3.4.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "MASS", + "R", + "cli", + "glue", + "grDevices", + "grid", + "gtable", + "isoband", + "lifecycle", + "mgcv", + "rlang", + "scales", + "stats", + "tibble", + "vctrs", + "withr" + ], + "Hash": "85846544c596e71f8f46483ab165da33" + }, + "glue": { + "Package": "glue", + "Version": "1.6.2", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "methods" + ], + "Hash": "4f2596dfb05dac67b9dc558e5c6fba2e" + }, + "gridExtra": { + "Package": "gridExtra", + "Version": "2.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grDevices", + "graphics", + "grid", + "gtable", + "utils" + ], + "Hash": "7d7f283939f563670a697165b2cf5560" + }, + "gtable": { + "Package": "gtable", + "Version": "0.3.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "grid", + "lifecycle", + "rlang" + ], + "Hash": "b29cf3031f49b04ab9c852c912547eef" + }, + "highr": { + "Package": "highr", + "Version": "0.10", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "xfun" + ], + "Hash": "06230136b2d2b9ba5805e1963fa6e890" + }, + "hms": { + "Package": "hms", + "Version": "1.1.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "lifecycle", + "methods", + "pkgconfig", + "rlang", + "vctrs" + ], + "Hash": "b59377caa7ed00fa41808342002138f9" + }, + "htmltools": { + "Package": "htmltools", + "Version": "0.5.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "base64enc", + "digest", + "ellipsis", + "fastmap", + "grDevices", + "rlang", + "utils" + ], + "Hash": "a2326a66919a3311f7fbb1e3bf568283" + }, + "htmlwidgets": { + "Package": "htmlwidgets", + "Version": "1.6.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grDevices", + "htmltools", + "jsonlite", + "knitr", + "rmarkdown", + "yaml" + ], + "Hash": "a865aa85bcb2697f47505bfd70422471" + }, + "igraph": { + "Package": "igraph", + "Version": "1.5.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "cli", + "cpp11", + "grDevices", + "graphics", + "lifecycle", + "magrittr", + "methods", + "pkgconfig", + "rlang", + "stats", + "utils" + ], + "Hash": "80401cb5ec513e8ddc56764d03f63669" + }, + "isoband": { + "Package": "isoband", + "Version": "0.2.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grid", + "utils" + ], + "Hash": "0080607b4a1a7b28979aecef976d8bc2" + }, + "jquerylib": { + "Package": "jquerylib", + "Version": "0.1.4", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "htmltools" + ], + "Hash": "5aab57a3bd297eee1c1d862735972182" + }, + "jsonlite": { + "Package": "jsonlite", + "Version": "1.8.7", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "methods" + ], + "Hash": "266a20443ca13c65688b2116d5220f76" + }, + "knitr": { + "Package": "knitr", + "Version": "1.43", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "evaluate", + "highr", + "methods", + "tools", + "xfun", + "yaml" + ], + "Hash": "9775eb076713f627c07ce41d8199d8f6" + }, + "labeling": { + "Package": "labeling", + "Version": "0.4.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "graphics", + "stats" + ], + "Hash": "b64ec208ac5bc1852b285f665d6368b3" + }, + "lattice": { + "Package": "lattice", + "Version": "0.21-8", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "grid", + "stats", + "utils" + ], + "Hash": "0b8a6d63c8770f02a8b5635f3c431e6b" + }, + "lifecycle": { + "Package": "lifecycle", + "Version": "1.0.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "cli", + "glue", + "rlang" + ], + "Hash": "001cecbeac1cff9301bdc3775ee46a86" + }, + "magrittr": { + "Package": "magrittr", + "Version": "2.0.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "7ce2733a9826b3aeb1775d56fd305472" + }, + "memoise": { + "Package": "memoise", + "Version": "2.0.1", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "cachem", + "rlang" + ], + "Hash": "e2817ccf4a065c5d9d7f2cfbe7c1d78c" + }, + "mgcv": { + "Package": "mgcv", + "Version": "1.9-0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "graphics", + "methods", + "nlme", + "splines", + "stats", + "utils" + ], + "Hash": "086028ca0460d0c368028d3bda58f31b" + }, + "mime": { + "Package": "mime", + "Version": "0.12", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "tools" + ], + "Hash": "18e9c28c1d3ca1560ce30658b22ce104" + }, + "munsell": { + "Package": "munsell", + "Version": "0.5.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "colorspace", + "methods" + ], + "Hash": "6dfe8bf774944bd5595785e3229d8771" + }, + "nlme": { + "Package": "nlme", + "Version": "3.1-163", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "graphics", + "lattice", + "stats", + "utils" + ], + "Hash": "8d1938040a05566f4f7a14af4feadd6b" + }, + "pillar": { + "Package": "pillar", + "Version": "1.9.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "cli", + "fansi", + "glue", + "lifecycle", + "rlang", + "utf8", + "utils", + "vctrs" + ], + "Hash": "15da5a8412f317beeee6175fbc76f4bb" + }, + "pkgconfig": { + "Package": "pkgconfig", + "Version": "2.0.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "utils" + ], + "Hash": "01f28d4278f15c76cddbea05899c5d6f" + }, + "plyr": { + "Package": "plyr", + "Version": "1.8.8", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "Rcpp" + ], + "Hash": "d744387aef9047b0b48be2933d78e862" + }, + "prettyunits": { + "Package": "prettyunits", + "Version": "1.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "95ef9167b75dde9d2ccc3c7528393e7e" + }, + "progress": { + "Package": "progress", + "Version": "1.2.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R6", + "crayon", + "hms", + "prettyunits" + ], + "Hash": "14dc9f7a3c91ebb14ec5bb9208a07061" + }, + "purrr": { + "Package": "purrr", + "Version": "1.0.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "lifecycle", + "magrittr", + "rlang", + "vctrs" + ], + "Hash": "1cba04a4e9414bdefc9dcaa99649a8dc" + }, + "rappdirs": { + "Package": "rappdirs", + "Version": "0.3.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "5e3c5dc0b071b21fa128676560dbe94d" + }, + "readr": { + "Package": "readr", + "Version": "2.1.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "cli", + "clipr", + "cpp11", + "crayon", + "hms", + "lifecycle", + "methods", + "rlang", + "tibble", + "tzdb", + "utils", + "vroom" + ], + "Hash": "b5047343b3825f37ad9d3b5d89aa1078" + }, + "renv": { + "Package": "renv", + "Version": "1.0.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "utils" + ], + "Hash": "4b22ac016fe54028b88d0c68badbd061" + }, + "rlang": { + "Package": "rlang", + "Version": "1.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "a85c767b55f0bf9b7ad16c6d7baee5bb" + }, + "rmarkdown": { + "Package": "rmarkdown", + "Version": "2.24", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "bslib", + "evaluate", + "fontawesome", + "htmltools", + "jquerylib", + "jsonlite", + "knitr", + "methods", + "stringr", + "tinytex", + "tools", + "utils", + "xfun", + "yaml" + ], + "Hash": "3854c37590717c08c32ec8542a2e0a35" + }, + "rstudioapi": { + "Package": "rstudioapi", + "Version": "0.15.0", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "5564500e25cffad9e22244ced1379887" + }, + "sass": { + "Package": "sass", + "Version": "0.4.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R6", + "fs", + "htmltools", + "rappdirs", + "rlang" + ], + "Hash": "6bd4d33b50ff927191ec9acbf52fd056" + }, + "scales": { + "Package": "scales", + "Version": "1.2.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "RColorBrewer", + "farver", + "labeling", + "lifecycle", + "munsell", + "rlang", + "viridisLite" + ], + "Hash": "906cb23d2f1c5680b8ce439b44c6fa63" + }, + "stringi": { + "Package": "stringi", + "Version": "1.7.12", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "stats", + "tools", + "utils" + ], + "Hash": "ca8bd84263c77310739d2cf64d84d7c9" + }, + "stringr": { + "Package": "stringr", + "Version": "1.5.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "magrittr", + "rlang", + "stringi", + "vctrs" + ], + "Hash": "671a4d384ae9d32fc47a14e98bfa3dc8" + }, + "tibble": { + "Package": "tibble", + "Version": "3.2.1", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "fansi", + "lifecycle", + "magrittr", + "methods", + "pillar", + "pkgconfig", + "rlang", + "utils", + "vctrs" + ], + "Hash": "a84e2cc86d07289b3b6f5069df7a004c" + }, + "tidyr": { + "Package": "tidyr", + "Version": "1.3.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "cpp11", + "dplyr", + "glue", + "lifecycle", + "magrittr", + "purrr", + "rlang", + "stringr", + "tibble", + "tidyselect", + "utils", + "vctrs" + ], + "Hash": "e47debdc7ce599b070c8e78e8ac0cfcf" + }, + "tidyselect": { + "Package": "tidyselect", + "Version": "1.2.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "rlang", + "vctrs", + "withr" + ], + "Hash": "79540e5fcd9e0435af547d885f184fd5" + }, + "tinytex": { + "Package": "tinytex", + "Version": "0.46", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "xfun" + ], + "Hash": "0c41a73214d982f539c56a7773c7afa5" + }, + "tzdb": { + "Package": "tzdb", + "Version": "0.4.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "cpp11" + ], + "Hash": "f561504ec2897f4d46f0c7657e488ae1" + }, + "utf8": { + "Package": "utf8", + "Version": "1.2.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "1fe17157424bb09c48a8b3b550c753bc" + }, + "vctrs": { + "Package": "vctrs", + "Version": "0.6.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "rlang" + ], + "Hash": "d0ef2856b83dc33ea6e255caf6229ee2" + }, + "viridis": { + "Package": "viridis", + "Version": "0.6.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "ggplot2", + "gridExtra", + "viridisLite" + ], + "Hash": "80cd127bc8c9d3d9f0904ead9a9102f1" + }, + "viridisLite": { + "Package": "viridisLite", + "Version": "0.4.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "c826c7c4241b6fc89ff55aaea3fa7491" + }, + "visNetwork": { + "Package": "visNetwork", + "Version": "2.1.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "htmltools", + "htmlwidgets", + "jsonlite", + "magrittr", + "methods", + "stats", + "utils" + ], + "Hash": "3e48b097e8d9a91ecced2ed4817a678d" + }, + "vroom": { + "Package": "vroom", + "Version": "1.6.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "bit64", + "cli", + "cpp11", + "crayon", + "glue", + "hms", + "lifecycle", + "methods", + "progress", + "rlang", + "stats", + "tibble", + "tidyselect", + "tzdb", + "vctrs", + "withr" + ], + "Hash": "8318e64ffb3a70e652494017ec455561" + }, + "withr": { + "Package": "withr", + "Version": "2.5.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "grDevices", + "graphics", + "stats" + ], + "Hash": "c0e49a9760983e81e55cdd9be92e7182" + }, + "xfun": { + "Package": "xfun", + "Version": "0.40", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "stats", + "tools" + ], + "Hash": "be07d23211245fc7d4209f54c4e4ffc8" + }, + "yaml": { + "Package": "yaml", + "Version": "2.3.7", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "0d0056cc5383fbc240ccd0cb584bf436" + } + } +} diff --git a/results/lifeExp.png b/results/lifeExp.png new file mode 100644 index 000000000..1be23f640 Binary files /dev/null and b/results/lifeExp.png differ diff --git a/safari-pinned-tab.svg b/safari-pinned-tab.svg new file mode 100644 index 000000000..8a74e60c8 --- /dev/null +++ b/safari-pinned-tab.svg @@ -0,0 +1,68 @@ + + + + +Created by potrace 1.14, written by Peter Selinger 2001-2017 + + + + + + + + diff --git a/site.webmanifest b/site.webmanifest new file mode 100644 index 000000000..f2302ffdd --- /dev/null +++ b/site.webmanifest @@ -0,0 +1,19 @@ +{ + "name": "The Carpentries", + "short_name": "The Carpentries", + "icons": [ + { + "src": "/android-chrome-192x192.png", + "sizes": "192x192", + "type": "image/png" + }, + { + "src": "/android-chrome-512x512.png", + "sizes": "512x512", + "type": "image/png" + } + ], + "theme_color": "#ffffff", + "background_color": "#ffffff", + "display": "standalone" +} diff --git a/sitemap.xml b/sitemap.xml new file mode 100644 index 000000000..1bbe1821e --- /dev/null +++ b/sitemap.xml @@ -0,0 +1,141 @@ + + + + https://swcarpentry.github.io/r-novice-gapminder/01-rstudio-intro.html + + + https://swcarpentry.github.io/r-novice-gapminder/02-project-intro.html + + + https://swcarpentry.github.io/r-novice-gapminder/03-seeking-help.html + + + https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html + + + https://swcarpentry.github.io/r-novice-gapminder/05-data-structures-part2.html + + + https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting.html + + + https://swcarpentry.github.io/r-novice-gapminder/07-control-flow.html + + + https://swcarpentry.github.io/r-novice-gapminder/08-plot-ggplot2.html + + + https://swcarpentry.github.io/r-novice-gapminder/09-vectorization.html + + + https://swcarpentry.github.io/r-novice-gapminder/10-functions.html + + + https://swcarpentry.github.io/r-novice-gapminder/11-writing-data.html + + + https://swcarpentry.github.io/r-novice-gapminder/12-plyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/13-dplyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/14-tidyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/15-knitr-markdown.html + + + https://swcarpentry.github.io/r-novice-gapminder/16-wrap-up.html + + + https://swcarpentry.github.io/r-novice-gapminder/404.html + + + https://swcarpentry.github.io/r-novice-gapminder/CODE_OF_CONDUCT.html + + + https://swcarpentry.github.io/r-novice-gapminder/LICENSE.html + + + https://swcarpentry.github.io/r-novice-gapminder/discuss.html + + + https://swcarpentry.github.io/r-novice-gapminder/index.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/01-rstudio-intro.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/02-project-intro.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/03-seeking-help.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/04-data-structures-part1.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/05-data-structures-part2.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/06-data-subsetting.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/07-control-flow.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/08-plot-ggplot2.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/09-vectorization.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/10-functions.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/11-writing-data.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/12-plyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/13-dplyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/14-tidyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/15-knitr-markdown.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/16-wrap-up.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/404.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/CODE_OF_CONDUCT.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/LICENSE.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/discuss.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/index.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/profiles.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/reference.html + + + https://swcarpentry.github.io/r-novice-gapminder/profiles.html + + + https://swcarpentry.github.io/r-novice-gapminder/reference.html + +