diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 000000000..e69de29bb diff --git a/01-rstudio-intro.html b/01-rstudio-intro.html new file mode 100644 index 000000000..3729bdb12 --- /dev/null +++ b/01-rstudio-intro.html @@ -0,0 +1,1469 @@ + +R for Reproducible Scientific Analysis: Introduction to R and RStudio +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Introduction to R and RStudio


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How to find your way around RStudio?
  • +
  • How to interact with R?
  • +
  • How to manage your environment?
  • +
  • How to install packages?
  • +


  • Describe the purpose and use of each pane in the RStudio IDE
  • +
  • Locate buttons and options in the RStudio IDE
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Manage a workspace in an interactive R session
  • +
  • Use mathematical and comparison operators
  • +
  • Call functions
  • +
  • Manage packages
  • +

Motivation +


Science is a multi-step process: once you’ve designed an experiment +and collected data, the real fun begins! This lesson will teach you how +to start this process using R and RStudio. We will begin with raw data, +perform exploratory analyses, and learn how to plot results graphically. +This example starts with a dataset from gapminder.org containing population +information for many countries through time. Can you read the data into +R? Can you plot the population for Senegal? Can you calculate the +average income for countries on the continent of Asia? By the end of +these lessons you will be able to do things like plot the populations +for all of these countries in under a minute!


Before Starting The Workshop +


Please ensure you have the latest version of R and RStudio installed +on your machine. This is important, as some packages used in the +workshop may not install correctly (or at all) if R is not up to +date.


Introduction to RStudio +


Welcome to the R portion of the Software Carpentry workshop.


Throughout this lesson, we’re going to teach you some of the +fundamentals of the R language as well as some best practices for +organizing code for scientific projects that will make your life +easier.


We’ll be using RStudio: a free, open-source R Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms +(including on servers) and provides many advantages such as integration +with version control and project management.


Basic layout


When you first open RStudio, you will be greeted by three panels:

  • The interactive R console/Terminal (entire left)
  • +
  • Environment/History/Connections (tabbed in upper right)
  • +
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • +
RStudio layout

Once you open files, such as R scripts, an editor panel will also +open in the top left.

RStudio layout with .R file open
+ +

R scripts +


Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have .R at the end of their +names to let you know what they are.


Workflow within RStudio +


There are two main ways one can work within RStudio:

  1. Test and play within the interactive R console then copy code into a +.R file to run later.
  2. +
  • This works well when doing small tests and initially starting +off.
  • +
  • It quickly becomes laborious
  • +
  1. Start writing in a .R file and use RStudio’s short cut keys for the +Run command to push the current line, selected lines or modified lines +to the interactive R console.
  2. +
  • This is a great way to start; all your code is saved for later
  • +
  • You will be able to run the file you create from within RStudio or +using R’s source() function.
  • +
+ +

Tip: Running segments of your code +


RStudio offers you great flexibility in running code from within the +editor window. There are buttons, menu choices, and keyboard shortcuts. +To run the current line, you can

  1. click on the Run button above the editor panel, or
  2. +
  3. select “Run Lines” from the “Code” menu, or
  4. +
  5. hit Ctrl+Return in Windows or Linux or ++Return on OS X. (This shortcut can also be seen +by hovering the mouse over the button). To run a block of code, select +it and then Run. If you have modified a line of code within +a block of code you have just run, there is no need to reselect the +section and Run, you can use the next button along, +Re-run the previous region. This will run the previous code +block including the modifications you have made.
  6. +

Introduction to R +


Much of your time in R will be spent in the R interactive console. +This is where you will run all of your code, and can be a useful +environment to try out ideas before adding them to an R script file. +This console in RStudio is the same as the one you would get if you +typed in R in your command-line environment.


The first thing you will see in the R interactive session is a bunch +of information, followed by a “>” and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a “Read, evaluate, print +loop”: you type in commands, R tries to execute them, and then returns a +result.


Using R as a calculator +


The simplest thing you could do with R is to do arithmetic:


R +

+1 + 100


[1] 101

And R will print out the answer, with a preceding “[1]”. [1] is the +index of the first element of the line being printed in the console. For +more information on indexing vectors, see Episode +6: Subsetting Data.


If you type in an incomplete command, R will wait for you to complete +it. If you are familiar with Unix Shell’s bash, you may recognize +this
+behavior from bash.


R +

> 1 +



Any time you hit return and the R session shows a “+” instead of a +“>”, it means it’s waiting for you to complete the command. If you +want to cancel a command you can hit Esc and RStudio will +give you back the “>” prompt.

+ +

Tip: Canceling commands +


If you’re using R from the command line instead of from within +RStudio, you need to use Ctrl+C instead of +Esc to cancel the command. This applies to Mac users as +well!


Canceling a command isn’t only useful for killing incomplete +commands: you can also use it to tell R to stop running code (for +example if it’s taking much longer than you expect), or to get rid of +the code you’re currently writing.


When using R as a calculator, the order of operations is the same as +you would have learned back in school.


From highest to lowest precedence:

  • Parentheses: (, ) +
  • +
  • Exponents: ^ or ** +
  • +
  • Multiply: * +
  • +
  • Divide: / +
  • +
  • Add: + +
  • +
  • Subtract: - +
  • +

R +

+3 + 5 * 2


[1] 13

Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend.


R +

+(3 + 5) * 2


[1] 16

This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code.


R +

+(3 + (5 * (2 ^ 2))) # hard to read
+3 + 5 * 2 ^ 2       # clear, if you remember the rules
+3 + 5 * (2 ^ 2)     # if you forget some rules, this might help

The text after each line of code is called a “comment”. Anything that +follows after the hash (or octothorpe) symbol # is ignored +by R when it executes code.


Really small or large numbers get a scientific notation:


R +



[1] 2e-04

Which is shorthand for “multiplied by 10^XX”. So +2e-4 is shorthand for 2 * 10^(-4).


You can write numbers in scientific notation too:


R +

+5e3  # Note the lack of minus here


[1] 5000

Mathematical functions +


R has many built in mathematical functions. To call a function, we +can type its name, followed by open and closing parentheses. Functions +take arguments as inputs, anything we type inside the parentheses of a +function is considered an argument. Depending on the function, the +number of arguments can vary from none to multiple. For example:


R +

+getwd() #returns an absolute filepath

doesn’t require an argument, whereas for the next set of mathematical +functions we will need to supply the function a value in order to +compute the result.


R +

+sin(1)  # trigonometry functions


[1] 0.841471

R +

+log(1)  # natural logarithm


[1] 0

R +

+log10(10) # base-10 logarithm


[1] 1

R +

+exp(0.5) # e^(1/2)


[1] 1.648721

Don’t worry about trying to remember every function in R. You can +look them up on Google, or if you can remember the start of the +function’s name, use the tab completion in RStudio.


This is one advantage that RStudio has over R on its own, it has +auto-completion abilities that allow you to more easily look up +functions, their arguments, and the values that they take.


Typing a ? before the name of a command will open the +help page for that command. When using RStudio, this will open the +‘Help’ pane; if using R in the terminal, the help page will open in your +browser. The help page will include a detailed description of the +command and how it works. Scrolling to the bottom of the help page will +usually show a collection of code examples which illustrate command +usage. We’ll go through an example later.


Comparing things +


We can also do comparisons in R:


R +

+1 == 1  # equality (note two equals signs, read as "is equal to")


[1] TRUE

R +

+1 != 2  # inequality (read as "is not equal to")


[1] TRUE

R +

+1 < 2  # less than


[1] TRUE

R +

+1 <= 1  # less than or equal to


[1] TRUE

R +

+1 > 0  # greater than


[1] TRUE

R +

+1 >= -9 # greater than or equal to


[1] TRUE
+ +

Tip: Comparing Numbers +


A word of warning about comparing numbers: you should never use +== to compare two numbers unless they are integers (a data +type which can specifically represent only whole numbers).


Computers may only represent decimal numbers with a certain degree of +precision, so two numbers which look the same when printed out by R, may +actually have different underlying representations and therefore be +different by a small margin of error (called Machine numeric +tolerance).


Instead you should use the all.equal function.


Further reading: http://floating-point-gui.de/


Variables and assignment +


We can store values in variables using the assignment operator +<-, like this:


R +

+x <- 1/40

Notice that assignment does not print a value. Instead, we stored it +for later in something called a variable. +x now contains the value +0.025:


R +



[1] 0.025

More precisely, the stored value is a decimal approximation +of this fraction called a floating point +number.


Look for the Environment tab in the top right panel of +RStudio, and you will see that x and its value have +appeared. Our variable x can be used in place of a number +in any calculation that expects a number:


R +



[1] -3.688879

Notice also that variables can be reassigned:


R +

+x <- 100

x used to contain the value 0.025 and now it has the +value 100.


Assignment values can contain the variable being assigned to:


R +

+x <- x + 1 #notice how RStudio updates its description of x on the top right tab
+y <- x * 2

The right hand side of the assignment can be any valid R expression. +The right hand side is fully evaluated before the assignment +occurs.


Variable names can contain letters, numbers, underscores and periods +but no spaces. They must start with a letter or a period followed by a +letter (they cannot start with a number nor an underscore). Variables +beginning with a period are hidden variables. Different people use +different conventions for long variable names, these include

  • periods.between.words
  • +
  • underscores_between_words
  • +
  • camelCaseToSeparateWords
  • +

What you use is up to you, but be consistent.


It is also possible to use the = operator for +assignment:


R +

+x = 1/40

But this is much less common among R users. The most important thing +is to be consistent with the operator you use. There +are occasionally places where it is less confusing to use +<- than =, and it is the most common symbol +used in the community. So the recommendation is to use +<-.

+ +

Challenge 1 +


Which of the following are valid R variable names?


R +

+ +

The following can be used as R variables:


R +


The following creates a hidden variable:


R +


The following will not be able to be used to create a variable


R +


Vectorization +


One final thing to be aware of is that R is vectorized, +meaning that variables and functions can have vectors as values. In +contrast to physics and mathematics, a vector in R describes a set of +values in a certain order of the same data type. For example


R +



[1] 1 2 3 4 5

R +



[1]  2  4  8 16 32

R +

+x <- 1:5


[1]  2  4  8 16 32

This is incredibly powerful; we will discuss this further in an +upcoming lesson.


Managing your environment +


There are a few useful commands you can use to interact with the R +session.


ls will list all of the variables and functions stored +in the global environment (your working R session):


R +



[1] "x" "y"
+ +

Tip: hidden objects +


Like in the shell, ls will hide any variables or +functions starting with a “.” by default. To list all objects, type +ls(all.names=TRUE) instead


Note here that we didn’t give any arguments to ls, but +we still needed to give the parentheses to tell R to call the +function.


If we type ls by itself, R prints a bunch of code +instead of a listing of objects.


R +



function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
+    pattern, sorted = TRUE) 
+    if (!missing(name)) {
+        pos <- tryCatch(name, error = function(e) e)
+        if (inherits(pos, "error")) {
+            name <- substitute(name)
+            if (!is.character(name)) 
+                name <- deparse(name)
+            warning(gettextf("%s converted to character string", 
+                sQuote(name)), domain = NA)
+            pos <- name
+        }
+    }
+    all.names <- .Internal(ls(envir, all.names, sorted))
+    if (!missing(pattern)) {
+        if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
+            ll != length(grep("]", pattern, fixed = TRUE))) {
+            if (pattern == "[") {
+                pattern <- "\\["
+                warning("replaced regular expression pattern '[' by  '\\\\['")
+            }
+            else if (length(grep("[^\\\\]\\[<-", pattern))) {
+                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+            }
+        }
+        grep(pattern, all.names, value = TRUE)
+    }
+    else all.names
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>

What’s going on here?


Like everything in R, ls is the name of an object, and +entering the name of an object by itself prints the contents of the +object. The object x that we created earlier contains 1, 2, +3, 4, 5:


R +



[1] 1 2 3 4 5

The object ls contains the R code that makes the +ls function work! We’ll talk more about how functions work +and start writing our own later.


You can use rm to delete objects you no longer need:


R +


If you have lots of things in your environment and want to delete all +of them, you can pass the results of ls to the +rm function:


R +

+rm(list = ls())

In this case we’ve combined the two. Like the order of operations, +anything inside the innermost parentheses is evaluated first, and so +on.


In this case we’ve specified that the results of ls +should be used for the list argument in rm. +When assigning values to arguments by name, you must use the += operator!!


If instead we use <-, there will be unintended side +effects, or you may get an error message:


R +

+rm(list <- ls())


Error in rm(list <- ls()): ... must contain names or character strings
+ +

Tip: Warnings vs. Errors +


Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn’t worked as expected.


In both cases, the message that R prints out usually give you clues +how to fix a problem.


R Packages +


It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages:

  • You can see what packages are installed by typing +installed.packages() +
  • +
  • You can install packages by typing +install.packages("packagename"), where +packagename is the package name, in quotes.
  • +
  • You can update installed packages by typing +update.packages() +
  • +
  • You can remove a package with +remove.packages("packagename") +
  • +
  • You can make a package available for use with +library(packagename) +
  • +

Packages can also be viewed, loaded, and detached in the Packages tab +of the lower right panel in RStudio. Clicking on this tab will display +all of the installed packages with a checkbox next to them. If the box +next to a package name is checked, the package is loaded and if it is +empty, the package is not loaded. Click an empty box to load that +package and click a checked box to detach that package.


Packages can be installed and updated from the Package tab with the +Install and Update buttons at the top of the tab.

+ +

Challenge 2 +


What will be the value of each variable after each statement in the +following program?


R +

+mass <- 47.5
+age <- 122
+mass <- mass * 2.3
+age <- age - 20
+ +

R +

+mass <- 47.5

This will give a value of 47.5 for the variable mass


R +

+age <- 122

This will give a value of 122 for the variable age


R +

+mass <- mass * 2.3

This will multiply the existing value of 47.5 by 2.3 to give a new +value of 109.25 to the variable mass.


R +

+age <- age - 20

This will subtract 20 from the existing value of 122 to give a new +value of 102 to the variable age.

+ +

Challenge 3 +


Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age?

+ +

One way of answering this question in R is to use the +> to set up the following:


R +

+mass > age


[1] TRUE

This should yield a boolean value of TRUE since 109.25 is greater +than 102.

+ +

Challenge 4 +


Clean up your working environment by deleting the mass and age +variables.

+ +

We can use the rm command to accomplish this task


R +

+rm(age, mass)
+ +

Challenge 5 +


Install the following packages: ggplot2, +plyr, gapminder

+ +

We can use the install.packages() command to install the +required packages.


R +


An alternate solution, to install multiple packages with a single +install.packages() command is:


R +

+install.packages(c("ggplot2", "plyr", "gapminder"))
+ +

Keypoints +

  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/02-project-intro.html b/02-project-intro.html new file mode 100644 index 000000000..3878b4fdb --- /dev/null +++ b/02-project-intro.html @@ -0,0 +1,821 @@ + +R for Reproducible Scientific Analysis: Project Management With RStudio +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Project Management With RStudio


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I manage my projects in R?
  • +


  • Create self-contained projects in RStudio
  • +

Introduction +


The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and eventually +everything is a bit mixed together.

+ +

Most people tend to organize their projects like this:

Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

  1. It is really hard to tell which version of your data is the original +and which is the modified;
  2. +
  3. It gets really messy because it mixes files with various extensions +together;
  4. +
  5. It probably takes you a lot of time to actually find things, and +relate the correct figures to the exact code that has been used to +generate it;
  6. +

A good project layout will ultimately make your life easier:

  • It will help ensure the integrity of your data;
  • +
  • It makes it simpler to share your code with someone else (a +lab-mate, collaborator, or supervisor);
  • +
  • It allows you to easily upload your code with your manuscript +submission;
  • +
  • It makes it easier to pick the project back up after a break.
  • +

A possible solution +


Fortunately, there are tools and packages which can help you manage +your work effectively.


One of the most powerful and useful aspects of RStudio is its project +management functionality. We’ll be using this today to create a +self-contained, reproducible project.

+ +

Challenge 1: Creating a self-contained +project +


We’re going to create a new project in RStudio:

  1. Click the “File” menu button, then “New Project”.
  2. +
  3. Click “New Directory”.
  4. +
  5. Click “New Project”.
  6. +
  7. Type in the name of the directory to store your project, +e.g. “my_project”.
  8. +
  9. If available, select the checkbox for “Create a git +repository.”
  10. +
  11. Click the “Create Project” button.
  12. +

The simplest way to open an RStudio project once it has been created +is to click through your file system to get to the directory where it +was saved and double click on the .Rproj file. This will +open RStudio and start your R session in the same directory as the +.Rproj file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added +benefit of allowing you to open multiple projects at the same time each +open to its own project directory. This allows you to keep multiple +projects open without them interfering with each other.

+ +

Challenge 2: Opening an RStudio project +through the file system +

  1. Exit RStudio.
  2. +
  3. Navigate to the directory where you created a project in Challenge +1.
  4. +
  5. Double click on the .Rproj file in that directory.
  6. +

Best practices for project organization +


Although there is no “best” way to lay out a project, there are some +general principles to adhere to that will make project management +easier:


Treat data as read only


This is probably the most important goal of setting up a project. +Data is typically time consuming and/or expensive to collect. Working +with them interactively (e.g., in Excel) where they can be modified +means you are never sure of where the data came from, or how it has been +modified since collection. It is therefore a good idea to treat your +data as “read-only”.


Data Cleaning


In many cases your data will be “dirty”: it will need significant +preprocessing to get into a format R (or any other programming language) +will find useful. This task is sometimes called “data munging”. Storing +these scripts in a separate folder, and creating a second “read-only” +data folder to hold the “cleaned” data sets can prevent confusion +between the two sets.


Treat generated output as disposable


Anything generated by your scripts should be treated as disposable: +it should all be able to be regenerated from your scripts.


There are lots of different ways to manage this output. Having an +output folder with different sub-directories for each separate analysis +makes it easier later. Since many analyses are exploratory and don’t end +up being used in the final project, and some of the analyses get shared +between projects.

+ +

Tip: Good Enough Practices for Scientific +Computing +


Good +Enough Practices for Scientific Computing gives the following +recommendations for project organization:

  1. Put each project in its own directory, which is named after the +project.
  2. +
  3. Put text documents associated with the project in the +doc directory.
  4. +
  5. Put raw data and metadata in the data directory, and +files generated during cleanup and analysis in a results +directory.
  6. +
  7. Put source for the project’s scripts and programs in the +src directory, and programs brought in from elsewhere or +compiled locally in the bin directory.
  8. +
  9. Name all files to reflect their content or function.
  10. +

Separate function definition and application


One of the more effective ways to work with R is to start by writing +the code you want to run directly in a .R script, and then running the +selected lines (either using the keyboard shortcuts in RStudio or +clicking the “Run” button) in the interactive R console.


When your project is in its early stages, the initial .R script file +usually contains many lines of directly executed code. As it matures, +reusable chunks get pulled into their own functions. It’s a good idea to +separate these functions into two separate folders; one to store useful +functions that you’ll reuse across analyses and projects, and one to +store the analysis scripts.


Save the data in the data directory


Now we have a good directory structure we will now place/save the +data file in the data/ directory.

+ +

Challenge 3 +


Download the gapminder data from here.

  1. Download the file (right mouse click on the link above -> “Save +link as” / “Save file as”, or click on the link and after the page +loads, press Ctrl+S or choose File -> “Save +page as”)
  2. +
  3. Make sure it’s saved under the name +gapminder_data.csv +
  4. +
  5. Save the file in the data/ folder within your +project.
  6. +

We will load and inspect these data later.

+ +

Challenge 4 +


It is useful to get some general idea about the dataset, directly +from the command line, before loading it into R. Understanding the +dataset better will come in handy when making decisions on how to load +it in R. Use the command-line shell to answer the following +questions:

  1. What is the size of the file?
  2. +
  3. How many rows of data does it contain?
  4. +
  5. What kinds of values are stored in this file?
  6. +
+ +

By running these commands in the shell:


SH +

ls -lh data/gapminder_data.csv


-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv

The file size is 80K.


SH +

wc -l data/gapminder_data.csv


1705 data/gapminder_data.csv

There are 1705 lines. The data looks like:


SH +

head data/gapminder_data.csv


+ +

Tip: command line in RStudio +


The Terminal tab in the console pane provides a convenient place +directly within RStudio to interact directly with the command line.


Working directory


Knowing R’s current working directory is important because when you +need to access other files (for example, to import a data file), R will +look for them relative to the current working directory.


Each time you create a new RStudio Project, it will create a new +directory for that project. When you open an existing +.Rproj file, it will open that project and set R’s working +directory to the folder that file is in.

+ +

Challenge 5 +


You can check the current working directory with the +getwd() command, or by using the menus in RStudio.

  1. In the console, type getwd() (“wd” is short for +“working directory”) and hit Enter.
  2. +
  3. In the Files pane, double click on the data folder to +open it (or navigate to any other folder you wish). To get the Files +pane back to the current working directory, click “More” and then select +“Go To Working Directory”.
  4. +

You can change the working directory with setwd(), or by +using RStudio menus.

  1. In the console, type setwd("data") and hit Enter. Type +getwd() and hit Enter to see the new working +directory.
  2. +
  3. In the menus at the top of the RStudio window, click the “Session” +menu button, and then select “Set Working Directory” and then “Choose +Directory”. Next, in the windows navigator that opens, navigate back to +the project directory, and click “Open”. Note that a setwd +command will automatically appear in the console.
  4. +
+ +

Tip: File does not exist errors +


When you’re attempting to reference a file in your R code and you’re +getting errors saying the file doesn’t exist, it’s a good idea to check +your working directory. You need to either provide an absolute path to +the file, or you need to make sure the file is saved in the working +directory (or a subfolder of the working directory) and provide a +relative path.


Version Control


It is important to use version control with projects. Go here +for a good lesson which describes using Git with RStudio.

+ +

Keypoints +

  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/03-seeking-help.html b/03-seeking-help.html new file mode 100644 index 000000000..3e5fb236e --- /dev/null +++ b/03-seeking-help.html @@ -0,0 +1,860 @@ + +R for Reproducible Scientific Analysis: Seeking Help +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Seeking Help


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I get help in R?
  • +


  • To be able to read R help files for functions and special +operators.
  • +
  • To be able to use CRAN task views to identify packages to solve a +problem.
  • +
  • To be able to seek help from your peers.
  • +

Reading Help Files +


R, and every package, provide help files for functions. The general +syntax to search for help on any function, “function_name”, from a +specific function that is in a package loaded into your namespace (your +interactive R session) is:


R +


For example take a look at the help file for +write.table(), we will be using a similar function in an +upcoming episode.


R +


This will load up a help page in RStudio (or as plain text in R +itself).


Each help page is broken down into sections:

  • Description: An extended description of what the function does.
  • +
  • Usage: The arguments of the function and their default values (which +can be changed).
  • +
  • Arguments: An explanation of the data each argument is +expecting.
  • +
  • Details: Any important details to be aware of.
  • +
  • Value: The data the function returns.
  • +
  • See Also: Any related functions you might find useful.
  • +
  • Examples: Some examples for how to use the function.
  • +

Different functions might have different sections, but these are the +main ones you should be aware of.


Notice how related functions might call for the same help file:


R +


This is because these functions have very similar applicability and +often share the same arguments as inputs to the function, so package +authors often choose to document them together in a single help +file.

+ +

Tip: Running Examples +


From within the function help page, you can highlight code in the +Examples and hit Ctrl+Return to run it in RStudio +console. This gives you a quick way to get a feel for how a function +works.

+ +

Tip: Reading Help Files +


One of the most daunting aspects of R is the large number of +functions available. It would be prohibitive, if not impossible to +remember the correct usage for every function you use. Luckily, using +the help files means you don’t have to remember that!


Special Operators +


To seek help on special operators, use quotes or backticks:


R +


Getting Help with Packages +


Many packages come with “vignettes”: tutorials and extended example +documentation. Without any arguments, vignette() will list +all vignettes for all installed packages; +vignette(package="package-name") will list all available +vignettes for package-name, and +vignette("vignette-name") will open the specified +vignette.


If a package doesn’t have any vignettes, you can usually find help by +typing help("package-name").


RStudio also has a set of excellent cheatsheets for +many packages.


When You Remember Part of the Function Name +


If you’re not sure what package a function is in or how it’s +specifically spelled, you can do a fuzzy search:


R +


A fuzzy search is when you search for an approximate string match. +For example, you may remember that the function to set your working +directory includes “set” in its name. You can do a fuzzy search to help +you identify the function:


R +


When You Have No Idea Where to Begin +


If you don’t know what function or package you need to use CRAN Task Views is a +specially maintained list of packages grouped into fields. This can be a +good starting point.


When Your Code Doesn’t Work: Seeking Help from Your Peers +


If you’re having trouble using a function, 9 times out of 10, the +answers you seek have already been answered on Stack Overflow. You can search +using the [r] tag. Please make sure to see their page on how to ask a good +question.


If you can’t find the answer, there are a few useful functions to +help you ask your peers:


R +


Will dump the data you’re working with into a format that can be +copied and pasted by others into their own R session.


R +



R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+Matrix products: default
+BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
+ [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
+time zone: UTC
+tzcode source: system (glibc)
+attached base packages:
+[1] stats     graphics  grDevices utils     datasets  methods   base     
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0 yaml_2.3.7       
+[5] knitr_1.43        xfun_0.40         renv_1.0.3        evaluate_0.21    

Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue.

+ +

Challenge 1 +


Look at the help page for the c function. What kind of +vector do you expect will be created if you evaluate the following:


R +

+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+ +

The c() function creates a vector, in which all elements +are of the same type. In the first case, the elements are numeric, in +the second, they are characters, and in the third they are also +characters: the numeric values are “coerced” to be characters.

+ +

Challenge 2 +


Look at the help for the paste function. You will need +to use it later. What’s the difference between the sep and +collapse arguments?

+ +

To look at the help for the paste() function, use:


R +


The difference between sep and collapse is +a little tricky. The paste function accepts any number of +arguments, each of which can be a vector of any length. The +sep argument specifies the string used between concatenated +terms — by default, a space. The result is a vector as long as the +longest argument supplied to paste. In contrast, +collapse specifies that after concatenation the elements +are collapsed together using the given separator, the result +being a single string.


It is important to call the arguments explicitly by typing out the +argument name e.g sep = "," so the function understands to +use the “,” as a separator and not a term to concatenate. e.g.


R +

+paste(c("a","b"), "c")


[1] "a c" "b c"

R +

+paste(c("a","b"), "c", ",")


[1] "a c ," "b c ,"

R +

+paste(c("a","b"), "c", sep = ",")


[1] "a,c" "b,c"

R +

+paste(c("a","b"), "c", collapse = "|")


[1] "a c|b c"

R +

+paste(c("a","b"), "c", sep = ",", collapse = "|")


[1] "a,c|b,c"

(For more information, scroll to the bottom of the +?paste help page and look at the examples, or try +example('paste').)

+ +

Challenge 3 +


Use help to find a function (and its associated parameters) that you +could use to load data from a tabular file in which columns are +delimited with “\t” (tab) and the decimal point is a “.” (period). This +check for decimal separator is important, especially if you are working +with international colleagues, because different countries have +different conventions for the decimal point (i.e. comma vs period). +Hint: use ??"read table" to look up functions related to +reading in tabular data.

+ +

The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +read.table(file, sep="\t") (the period is the +default decimal separator for read.table()), +although you may have to change the comment.char argument +as well if your data file contains hash (#) characters.


Other Resources +

+ +

Keypoints +

  • Use help() to get online help in R.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/04-data-structures-part1.html b/04-data-structures-part1.html new file mode 100644 index 000000000..680e3e144 --- /dev/null +++ b/04-data-structures-part1.html @@ -0,0 +1,2396 @@ + +R for Reproducible Scientific Analysis: Data Structures +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Data Structures


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I read data in R?
  • +
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +


  • To be able to identify the 5 main data types.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
  • To understand the information of the attributes “names”, “class”, +and “dim”.
  • +

One of R’s most powerful features is its ability to deal with tabular +data - such as you may already have in a spreadsheet or a CSV file. +Let’s start by making a toy dataset in your data/ +directory, called feline-data.csv:


R +

+cats <- data.frame(coat = c("calico", "black", "tabby"),
+                    weight = c(2.1, 5.0, 3.2),
+                    likes_string = c(1, 0, 1))

We can now save cats as a CSV file. It is good practice +to call the argument names explicitly so the function knows what default +values you are changing. Here we are setting +row.names = FALSE. Recall you can use +?write.csv to pull up the help file to check out the +argument names and their default values.


R +

+write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)

The contents of the new file, feline-data.csv:


R +

+ +

Tip: Editing Text files in R +


Alternatively, you can create data/feline-data.csv using +a text editor (Nano), or within RStudio with the File -> New +File -> Text File menu item.


We can load this into R via the following:


R +

+cats <- read.csv(file = "data/feline-data.csv")


    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1

The read.table function is used for reading in tabular +data stored in a text file where the columns of data are separated by +punctuation characters such as CSV files (csv = comma-separated values). +Tabs and commas are the most common punctuation characters used to +separate or delimit data points in csv files. For convenience R provides +2 other versions of read.table. These are: +read.csv for files where the data are separated with commas +and read.delim for files where the data are separated with +tabs. Of these three functions read.csv is the most +commonly used. If needed it is possible to override the default +delimiting punctuation marks for both read.csv and +read.delim.

+ +

Check your data for factors +


In recent times, the default way how R handles textual data has +changed. Text data was interpreted by R automatically into a format +called “factors”. But there is an easier format that is called +“character”. We will hear about factors later, and what to use them for. +For now, remember that in most cases, they are not needed and only +complicate your life, which is why newer R versions read in text as +“character”. Check now if your version of R has automatically created +factors and convert them to “character” format:

  1. Check the data types of your input by typing +str(cats) +
  2. +
  3. In the output, look at the three-letter codes after the colons: If +you see only “num” and “chr”, you can continue with the lesson and skip +this box. If you find “fct”, continue to step 3.
  4. +
  5. Prevent R from automatically creating “factor” data. That can be +done by the following code: +options(stringsAsFactors = FALSE). Then, re-read the cats +table for the change to take effect.
  6. +
  7. You must set this option every time you restart R. To not forget +this, include it in your analysis script before you read in any data, +for example in one of the first lines.
  8. +
  9. For R versions greater than 4.0.0, text data is no longer converted +to factors anymore. So you can install this or a newer version to avoid +this problem. If you are working on an institute or company computer, +ask your administrator to do it.
  10. +

We can begin exploring our dataset right away, pulling out columns by +specifying them using the $ operator:


R +



[1] 2.1 5.0 3.2

R +



[1] "calico" "black"  "tabby" 

We can do other operations on the columns:


R +

+## Say we discovered that the scale weighs two Kg light:
+cats$weight + 2


[1] 4.1 7.0 5.2

R +

+paste("My cat is", cats$coat)


[1] "My cat is calico" "My cat is black"  "My cat is tabby" 

But what about


R +

+cats$weight + cats$coat


Error in cats$weight + cats$coat: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing +data in R.


Data Types


If you guessed that the last command will return an error because +2.1 plus "black" is nonsense, you’re right - +and you already have some intuition for an important concept in +programming called data types. We can ask what type of data +something is:


R +



[1] "double"

There are 5 main types: double, integer, +complex, logical and character. +For historic reasons, double is also called +numeric.


R +



[1] "double"

R +

+typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers


[1] "integer"

R +



[1] "complex"

R +



[1] "logical"

R +



[1] "character"

No matter how complicated our analyses become, all data in R is +interpreted as one of these basic data types. This strictness has some +really important consequences.


A user has added details of another cat. This information is in the +file data/feline-data_v2.csv.


R +


R +

+tabby,2.3 or 2.4,1

Load the new cats data like before, and check what type of data we +find in the weight column:


R +

+cats <- read.csv(file="data/feline-data_v2.csv")


[1] "character"

Oh no, our weights aren’t the double type anymore! If we try to do +the same math we did on them before, we run into trouble:


R +

+cats$weight + 2


Error in cats$weight + 2: non-numeric argument to binary operator

What happened? The cats data we are working with is +something called a data frame. Data frames are one of the most +common and versatile types of data structures we will work with +in R. A given column in a data frame cannot be composed of different +data types. In this case, R does not read everything in the data frame +column weight as a double, therefore the entire +column data type changes to something that is suitable for everything in +the column.


When R reads a csv file, it reads it in as a data frame. +Thus, when we loaded the cats csv file, it is stored as a +data frame. We can recognize data frames by the first row that is +written by the str() function:


R +



'data.frame':	4 obs. of  3 variables:
+ $ coat        : chr  "calico" "black" "tabby" "tabby"
+ $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
+ $ likes_string: int  1 0 1 1

Data frames are composed of rows and columns, where each +column has the same number of rows. Different columns in a data frame +can be made up of different data types (this is what makes them so +versatile), but everything in a given column needs to be the same type +(e.g., vector, factor, or list).


Let’s explore more about different data structures and how they +behave. For now, let’s remove that extra line from our cats data and +reload it, while we investigate this behavior further:




And back in RStudio:


R +

+cats <- read.csv(file="data/feline-data.csv")

Vectors and Type Coercion


To better understand this behavior, let’s meet another of the data +structures: the vector.


R +

+my_vector <- vector(length = 3)



A vector in R is essentially an ordered list of things, with the +special condition that everything in the vector must be the same +basic data type. If you don’t choose the datatype, it’ll default to +logical; or, you can declare an empty vector of whatever +type you like.


R +

+another_vector <- vector(mode='character', length=3)


[1] "" "" ""

You can check if something is a vector:


R +



 chr [1:3] "" "" ""

The somewhat cryptic output from this command indicates the basic +data type found in this vector - in this case chr, +character; an indication of the number of things in the vector - +actually, the indexes of the vector, in this case [1:3]; +and a few examples of what’s actually in the vector - in this case empty +character strings. If we similarly do


R +



 num [1:3] 2.1 5 3.2

we see that cats$weight is a vector, too - the +columns of data we load into R data.frames are all vectors, and +that’s the root of why R forces everything in a column to be the same +basic data type.

+ +

Discussion 1 +


Why is R so opinionated about what we put in our columns of data? How +does this help us?

+ +

By keeping everything in a column the same, we allow ourselves to +make simple assumptions about our data; if you can interpret one entry +in the column as a number, then you can interpret all of them +as numbers, so we don’t have to check every time. This consistency is +what people mean when they talk about clean data; in the long +run, strict consistency goes a long way to making our lives easier in +R.


Coercion by combining vectors


You can also make vectors with explicit contents with the combine +function:


R +

+combine_vector <- c(2,6,3)


[1] 2 6 3

Given what we’ve learned so far, what do you think the following will +produce?


R +

+quiz_vector <- c(2,6,'3')

This is something called type coercion, and it is the source +of many surprises and the reason why we need to be aware of the basic +data types and how R will interpret them. When R encounters a mix of +types (here double and character) to be combined into a single vector, +it will force them all to be the same type. Consider:


R +

+coercion_vector <- c('a', TRUE)


[1] "a"    "TRUE"

R +

+another_coercion_vector <- c(0, TRUE)


[1] 0 1

The type hierarchy


The coercion rules go: logical -> +integer -> double (“numeric”) +-> complex -> character, where -> can +be read as are transformed into. For example, combining +logical and character transforms the result to +character:


R +

+c('a', TRUE)


[1] "a"    "TRUE"

A quick way to recognize character vectors is by the +quotes that enclose them when they are printed.


You can try to force coercion against this flow using the +as. functions:


R +

+character_vector_example <- c('0','2','4')


[1] "0" "2" "4"

R +

+character_coerced_to_double <- as.double(character_vector_example)


[1] 0 2 4

R +

+double_coerced_to_logical <- as.logical(character_coerced_to_double)



As you can see, some surprising things can happen when R forces one +basic data type into another! Nitty-gritty of type coercion aside, the +point is: if your data doesn’t look like what you thought it was going +to look like, type coercion may well be to blame; make sure everything +is the same type in your vectors and your columns of data.frames, or you +will get nasty surprises!


But coercion can also be very useful! For example, in our +cats data likes_string is numeric, but we know +that the 1s and 0s actually represent TRUE and +FALSE (a common way of representing them). We should use +the logical datatype here, which has two states: +TRUE or FALSE, which is exactly what our data +represents. We can ‘coerce’ this column to be logical by +using the as.logical function:


R +



[1] 1 0 1

R +

+cats$likes_string <- as.logical(cats$likes_string)


+ +

Challenge 1 +


An important part of every data analysis is cleaning the input data. +If you know that the input data is all of the same format, +(e.g. numbers), your analysis is much easier! Clean the cat data set +from the chapter about type coercion.


Copy the code template


Create a new script in RStudio and copy and paste the following code. +Then move on to the tasks below, which help you to fill in the gaps +(______).

# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+# 1. Print the data
+# 2. Show an overview of the table with all data types
+# 3. The "weight" column has the incorrect data type __________.
+#    The correct data type is: ____________.
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+#    print the data again to see the effect
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+#    Calculate the mean to test yourself
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!

Instructions for the tasks

+ +

Execute the first statement (read.csv(...)). Then print +the data to the console

+ +

Show the content of any variable by typing its name.


Solution to Challenge 1.1


Two correct solutions:

+ +

2. Overview of the data types +


The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of +the cats table.

+ +

In the chapter “Data types” we saw two functions that can show data +types. One printed just a single word, the data type name. The other +printed a short form of the data type, and the first few values. We need +the second here.

+ +

Challenge 1 (continued) +


Solution to Challenge 1.2


3. Which data type do we need?


The shown data type is not the right one for this data (weight of a +cat). Which data type do we need?

  • Why did the read.csv() function not choose the correct +data type?
  • +
  • Fill in the gap in the comment with the correct data type for cat +weight!
  • +
+ +

Scroll up to the section about the type +hierarchy to review the available data types

+ +
  • Weight is expressed on a continuous scale (real numbers). The R data +type for this is “double” (also known as “numeric”).
  • +
  • The fourth row has the value “2.3 or 2.4”. That is not a number but +two, and an english word. Therefore, the “character” data type is +chosen. The whole column is now text, because all values in the same +columns have to be the same data type.
  • +
+ +

4. Correct the problematic value +


The code to assign a new weight value to the problematic fourth row +is given. Think first and then execute it: What will be the data type +after assigning a number like in this example? You can check the data +type after executing to see if you were right.

+ +

Revisit the hierarchy of data types when two different data types are +combined.

+ +

Challenge 1 (continued) +


Solution to challenge 1.4


The data type of the column “weight” is “character”. The assigned +data type is “double”. Combining two data types yields the data type +that is higher in the following hierarchy:

logical < integer < double < complex < character

Therefore, the column is still of type character! We need to manually +convert it to “double”. {: .solution}


5. Convert the column “weight” to the correct data type


Cat weight are numbers. But the column does not have this data type +yet. Coerce the column to floating point numbers.

+ +

The functions to convert data types start with as.. You +can look for the function further up in the manuscript or use the +RStudio auto-complete function: Type “as.” and then press +the TAB key.

+ +

Challenge 1 (continued) +


Solution to Challenge 1.5


There are two functions that are synonymous for historic reasons:

cats$weight <- as.double(cats$weight)
+cats$weight <- as.numeric(cats$weight)

Some basic vector functions


The combine function, c(), will also append things to an +existing vector:


R +

+ab_vector <- c('a', 'b')


[1] "a" "b"

R +

+combine_example <- c(ab_vector, 'SWC')


[1] "a"   "b"   "SWC"

You can also make series of numbers:


R +

+mySeries <- 1:10


 [1]  1  2  3  4  5  6  7  8  9 10

R +



 [1]  1  2  3  4  5  6  7  8  9 10

R +

+seq(1,10, by=0.1)


 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
+[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
+[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
+[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
+[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
+[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
+[91] 10.0

We can ask a few questions about vectors:


R +

+sequence_example <- 20:25
+head(sequence_example, n=2)


[1] 20 21

R +

+tail(sequence_example, n=4)


[1] 22 23 24 25

R +



[1] 6

R +



[1] "integer"

We can get individual elements of a vector by using the bracket +notation:


R +

+first_element <- sequence_example[1]


[1] 20

To change a single element, use the bracket on the other side of the +arrow:


R +

+sequence_example[1] <- 30


[1] 30 21 22 23 24 25
+ +

Challenge 2 +


Start by making a vector with the numbers 1 through 26. Then, +multiply the vector by 2.

+ +

R +

+x <- 1:26
+x <- x * 2



Another data structure you’ll want in your bag of tricks is the +list. A list is simpler in some ways than the other types, +because you can put anything you want in it. Remember everything in +the vector must be of the same basic data type, but a list can have +different data types:


R +

+list_example <- list(1, "a", TRUE, 1+4i)


+[1] 1
+[1] "a"
+[1] TRUE
+[1] 1+4i

When printing the object structure with str(), we see +the data types of all elements:


R +



List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i

What is the use of lists? They can organize data of different +types. For example, you can organize different tables that +belong together, similar to spreadsheets in Excel. But there are many +other uses, too.


We will see another example that will maybe surprise you in the next +chapter.


To retrieve one of the elements of a list, use the double +bracket:


R +



[1] "a"

The elements of lists also can have names, they can +be given by prepending them to the values, separated by an equals +sign:


R +

+another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )


+[1] "Numbers"
+ [1]  1  2  3  4  5  6  7  8  9 10
+[1] TRUE

This results in a named list. Now we have a new +function of our object! We can access single elements by an additional +way!


R +



[1] "Numbers"

Names +


With names, we can give meaning to elements. It is the first time +that we do not only have the data, but also explaining +information. It is metadata that can be stuck to the object +like a label. In R, this is called an attribute. Some +attributes enable us to do more with our object, for example, like here, +accessing an element by a self-defined name.


Accessing vectors and lists by name


We have already seen how to generate a named list. The way to +generate a named vector is very similar. You have seen this function +before:


R +

+pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )

The way to retrieve elements is different, though:


R +



+       5.64 

The approach used for the list does not work:


R +



Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors

It will pay off if you remember this error message, you will meet it +in your own analyses. It means that you have just tried accessing an +element like it was in a list, but it is actually in a vector.


Accessing and changing names


If you are only interested in the names, use the names() +function:


R +



[1] "pizzasubito" "pizzafresh"  "callapizza" 

We have seen how to access and change single elements of a vector. +The same is possible for names:


R +



[1] "callapizza"

R +

+names(pizza_price)[3] <- "call-a-pizza"


 pizzasubito   pizzafresh call-a-pizza 
+        5.64         6.60         4.50 
+ +

Challenge 3 +

  • What is the data type of the names of pizza_price? You +can find out using the str() or typeof() +functions.
  • +
+ +

You get the names of an object by wrapping the object name inside +names(...). Similarly, you get the data type of the names +by again wrapping the whole code in typeof(...):


alternatively, use a new variable if this is easier for you to +read:

n <- names(pizza)
+ +

Challenge 4 +


Instead of just changing some of the names a vector/list already has, +you can also set all names of an object by writing code like (replace +ALL CAPS text):


Create a vector that gives the number for each letter in the +alphabet!

  1. Generate a vector called letter_no with the sequence of +numbers from 1 to 26!
  2. +
  3. R has a built-in object called LETTERS. It is a +26-character vector, from A to Z. Set the names of the number sequence +to this 26 letters
  4. +
  5. Test yourself by calling letter_no["B"], which should +give you the number 2!
  6. +
+ +
letter_no <- 1:26   # or seq(1,26)
+names(letter_no) <- LETTERS

Data frames +


We have data frames at the very beginning of this lesson, they +represent a table of data. We didn’t go much further into detail with +our example cat data frame:


R +



    coat weight likes_string
+1 calico    2.1         TRUE
+2  black    5.0        FALSE
+3  tabby    3.2         TRUE

We can now understand something a bit surprising in our data.frame; +what happens if we run:


R +



[1] "list"

We see that data.frames look like lists ‘under the hood’. Think again +what we heard about what lists can be used for:


Lists organize data of different types


Columns of a data frame are vectors of different types, that are +organized by belonging to the same table.


A data.frame is really a list of vectors. It is a special list in +which all the vectors must have the same length.


How is this “special”-ness written into the object, so that R does +not treat it like any other list, but as a table?


R +



[1] "data.frame"

A class, just like names, is an attribute attached +to the object. It tells us what this object means for humans.


You might wonder: Why do we need another +what-type-of-object-is-this-function? We already have +typeof()? That function tells us how the object is +constructed in the computer. The class is +the meaning of the object for humans. Consequently, +what typeof() returns is fixed in R (mainly the +five data types), whereas the output of class() is +diverse and extendable by R packages.


In our cats example, we have an integer, a double and a +logical variable. As we have seen already, each column of data.frame is +a vector.


R +



[1] "calico" "black"  "tabby" 

R +



[1] "calico" "black"  "tabby" 

R +



[1] "character"

R +



 chr [1:3] "calico" "black" "tabby"

Each row is an observation of different variables, itself a +data.frame, and thus can be composed of elements of different types.


R +



    coat weight likes_string
+1 calico    2.1         TRUE

R +



[1] "list"

R +



'data.frame':	1 obs. of  3 variables:
+ $ coat        : chr "calico"
+ $ weight      : num 2.1
+ $ likes_string: logi TRUE
+ +

Challenge 5 +


There are several subtly different ways to call variables, +observations and elements from data.frames:

  • cats[1]
  • +
  • cats[[1]]
  • +
  • cats$coat
  • +
  • cats["coat"]
  • +
  • cats[1, 1]
  • +
  • cats[, 1]
  • +
  • cats[1, ]
  • +

Try out these examples and explain what is returned by each one.


Hint: Use the function typeof() to examine what +is returned in each case.

+ +

R +



+1 calico
+2  black
+3  tabby

We can think of a data frame as a list of vectors. The single brace +[1] returns the first slice of the list, as another list. +In this case it is the first column of the data frame.


R +



[1] "calico" "black"  "tabby" 

The double brace [[1]] returns the contents of the list +item. In this case it is the contents of the first column, a +vector of type character.


R +



[1] "calico" "black"  "tabby" 

This example uses the $ character to address items by +name. coat is the first column of the data frame, again a +vector of type character.


R +



+1 calico
+2  black
+3  tabby

Here we are using a single brace ["coat"] replacing the +index number with the column name. Like example 1, the returned object +is a list.


R +

+cats[1, 1]


[1] "calico"

This example uses a single brace, but this time we provide row and +column coordinates. The returned object is the value in row 1, column 1. +The object is a vector of type character.


R +

+cats[, 1]


[1] "calico" "black"  "tabby" 

Like the previous example we use single braces and provide row and +column coordinates. The row coordinate is not specified, R interprets +this missing value as all the elements in this column and +returns them as a vector.


R +

+cats[1, ]


    coat weight likes_string
+1 calico    2.1         TRUE

Again we use the single brace with row and column coordinates. The +column coordinate is not specified. The return value is a list +containing all the values in the first row.

+ +

Tip: Renaming data frame columns +


Data frames have column names, which can be accessed with the +names() function.


R +



[1] "coat"         "weight"       "likes_string"

If you want to rename the second column of cats, you can +assign a new name to the second element of names(cats).


R +

+names(cats)[2] <- "weight_kg"


    coat weight_kg likes_string
+1 calico       2.1         TRUE
+2  black       5.0        FALSE
+3  tabby       3.2         TRUE



Last but not least is the matrix. We can declare a matrix full of +zeros:


R +

+matrix_example <- matrix(0, ncol=6, nrow=3)


     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    0    0    0    0    0    0
+[2,]    0    0    0    0    0    0
+[3,]    0    0    0    0    0    0

What makes it special is the dim() attribute:


R +



[1] 3 6

And similar to other data structures, we can ask things about our +matrix:


R +



[1] "double"

R +



[1] "matrix" "array" 

R +



 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...

R +



[1] 3

R +



[1] 6
+ +

Challenge 6 +


What do you think will be the result of +length(matrix_example)? Try it. Were you right? Why / why +not?

+ +

What do you think will be the result of +length(matrix_example)?


R +

+matrix_example <- matrix(0, ncol=6, nrow=3)


[1] 18

Because a matrix is a vector with added dimension attributes, +length gives you the total number of elements in the +matrix.

+ +

Challenge 7 +


Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+ +

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)


R +

+x <- matrix(1:50, ncol=5, nrow=10)
+x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
+ +

Challenge 8 +


Create a list of length two containing a character vector for each of +the sections in this part of the workshop:

  • Data types
  • +
  • Data structures
  • +

Populate each character vector with the names of the data types and +data structures we’ve seen so far.

+ +

R +

+dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
+dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
+answer <- list(dataTypes, dataStructures)

Note: it’s nice to make a list in big writing on the board or taped +to the wall listing all of these types and structures - leave it up for +the rest of the workshop to remind people of the importance of these +basics.

+ +

Challenge 9 +


Consider the R output of the matrix below:



     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. +
  3. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  4. +
  5. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  6. +
  7. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
  8. +
+ +

Consider the R output of the matrix below:



     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.


R +

+matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
+ +

Keypoints +

  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/05-data-structures-part2.html b/05-data-structures-part2.html new file mode 100644 index 000000000..c85c29794 --- /dev/null +++ b/05-data-structures-part2.html @@ -0,0 +1,1209 @@ + +R for Reproducible Scientific Analysis: Exploring Data Frames +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Exploring Data Frames


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I manipulate a data frame?
  • +


  • Add and remove rows or columns.
  • +
  • Append two data frames.
  • +
  • Display basic properties of data frames including size and class of +the columns, names, and first few rows.
  • +

At this point, you’ve seen it all: in the last lesson, we toured all +the basic data types and data structures in R. Everything you do will be +a manipulation of those tools. But most of the time, the star of the +show is the data frame—the table that we created by loading information +from a csv file. In this lesson, we’ll learn a few more things about +working with data frames.


Adding columns and rows in data frames +


We already learned that the columns of a data frame are vectors, so +that our data are consistent in type throughout the columns. As such, if +we want to add a new column, we can start by making a new vector:


R +

+age <- c(2, 3, 5)


    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1

We can then add this as a column via:


R +

+cbind(cats, age)


    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5

Note that if we tried to add a vector of ages with a different number +of entries than the number of rows in the data frame, it would fail:


R +

+age <- c(2, 3, 5, 12)
+cbind(cats, age)


Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4

R +

+age <- c(2, 3)
+cbind(cats, age)


Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2

Why didn’t this work? Of course, R wants to see one element in our +new column for every row in the table:


R +



[1] 3

R +



[1] 2

So for it to work we need to have nrow(cats) = +length(age). Let’s overwrite the content of cats with our +new data frame.


R +

+age <- c(2, 3, 5)
+cats <- cbind(cats, age)

Now how about adding rows? We already know that the rows of a data +frame are lists:


R +

+newRow <- list("tortoiseshell", 3.3, TRUE, 9)
+cats <- rbind(cats, newRow)

Let’s confirm that our new row was added correctly.


R +



           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9

Removing rows +


We now know how to add rows and columns to our data frame in R. Now +let’s learn to remove rows.


R +



           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9

We can ask for a data frame minus the last row:


R +

+cats[-4, ]


    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5

Notice the comma with nothing after it to indicate that we want to +drop the entire fourth row.


Note: we could also remove several rows at once by putting the row +numbers inside of a vector, for example: +cats[c(-3,-4), ]


Removing columns +


We can also remove columns in our data frame. What if we want to +remove the column “age”. We can remove it in two ways, by variable +number or by index.


R +



           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1

Notice the comma with nothing before it, indicating we want to keep +all of the rows.


Alternatively, we can drop the column by using the index name and the +%in% operator. The %in% operator goes through +each element of its left argument, in this case the names of +cats, and asks, “Does this element occur in the second +argument?”


R +

+drop <- names(cats) %in% c("age")


           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1

We will cover subsetting with logical operators like +%in% in more detail in the next episode. See the section Subsetting through other logical +operations


Appending to a data frame +


The key to remember when adding data to a data frame is that +columns are vectors and rows are lists. We can also glue two +data frames together with rbind:


R +

+cats <- rbind(cats, cats)


           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9

But now the row names are unnecessarily complicated. We can remove +the rownames, and R will automatically re-name them sequentially:


R +

+rownames(cats) <- NULL


           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+ +

Challenge 1 +


You can create a new data frame right from within R with the +following syntax:


R +

+df <- data.frame(id = c("a", "b", "c"),
+                 x = 1:3,
+                 y = c(TRUE, TRUE, FALSE))

Make a data frame that holds the following information for +yourself:

  • first name
  • +
  • last name
  • +
  • lucky number
  • +

Then use rbind to add an entry for the people sitting +beside you. Finally, use cbind to add a column with each +person’s answer to the question, “Is it time for coffee break?”

+ +

R +

+df <- data.frame(first = c("Grace"),
+                 last = c("Hopper"),
+                 lucky_number = c(0))
+df <- rbind(df, list("Marie", "Curie", 238) )
+df <- cbind(df, coffeetime = c(TRUE,TRUE))

Realistic example +


So far, you have seen the basics of manipulating data frames with our +cat data; now let’s use those skills to digest a more realistic dataset. +Let’s read in the gapminder dataset that we downloaded +previously:


R +

+gapminder <- read.csv("data/gapminder_data.csv")
+ +

Miscellaneous Tips +

  • Another type of file you might encounter are tab-separated value +files (.tsv). To specify a tab as a separator, use "\\t" or +read.delim().

  • +
  • Files can also be downloaded directly from the Internet into a +local folder of your choice onto your computer using the +download.file function. The read.csv function +can then be executed to read the downloaded file from the download +location, for example,

  • +

R +

+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv("data/gapminder_data.csv")
  • Alternatively, you can also read in files directly into R from the +Internet by replacing the file paths with a web address in +read.csv. One should note that in doing this no local copy +of the csv file is first saved onto your computer. For example,
  • +

R +

+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
  • You can read directly from excel spreadsheets without converting +them to plain text first by using the readxl +package.

  • +
  • The argument “stringsAsFactors” can be useful to tell R how to +read strings either as factors or as character strings. In R versions +after 4.0, all strings are read-in as characters by default, but in +earlier versions of R, strings are read-in as factors by default. For +more information, see the call-out in the +previous episode.

  • +

Let’s investigate gapminder a bit; the first thing we should always +do is check out what the data looks like with str:


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...

An additional method for examining the structure of gapminder is to +use the summary function. This function can be used on +various objects in R. For data frames, summary yields a +numeric, tabular, or descriptive summary of each column. Numeric or +integer columns are described by the descriptive statistics (quartiles +and mean), and character columns by its length, class, and mode.


R +



   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  

Along with the str and summary functions, +we can examine individual columns of the data frame with our +typeof function:


R +



[1] "integer"

R +



[1] "character"

R +



 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

We can also interrogate the data frame for information about its +dimensions; remembering that str(gapminder) said there were +1704 observations of 6 variables in gapminder, what do you think the +following will produce, and why?


R +



[1] 6

A fair guess would have been to say that the length of a data frame +would be the number of rows it has (1704), but this is not the case; +remember, a data frame is a list of vectors and factors:


R +



[1] "list"

When length gave us 6, it’s because gapminder is built +out of a list of 6 columns. To get the number of rows and columns in our +dataset, try:


R +



[1] 1704

R +



[1] 6

Or, both at once:


R +



[1] 1704    6

We’ll also likely want to know what the titles of all the columns +are, so we can ask for them later:


R +



[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

At this stage, it’s important to ask ourselves if the structure R is +reporting matches our intuition or expectations; do the basic data types +reported for each column make sense? If not, we need to sort any +problems out now before they turn into bad surprises down the road, +using what we’ve learned about how R interprets data, and the importance +of strict consistency in how we record our data.


Once we’re happy that the data types and structures seem reasonable, +it’s time to start digging into our data proper. Check out the first few +lines:


R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+ +

Challenge 2 +


It’s good practice to also check the last few lines of your data and +some in the middle. How would you do this?


Searching for ones specifically in the middle isn’t too hard, but we +could ask for a few lines at random. How would you code this?

+ +

To check the last few lines it’s relatively simple as R already has a +function for this:


R +

+tail(gapminder, n = 15)

What about a few arbitrary rows just in case something is odd in the +middle?


Tip: There are several ways to achieve this.


The solution here presents one form of using nested functions, i.e. a +function passed as an argument to another function. This might sound +like a new concept, but you are already using it! Remember +my_dataframe[rows, cols] will print to screen your data frame with the +number of rows and columns you asked for (although you might have asked +for a range or named columns for example). How would you get the last +row if you don’t know how many rows your data frame has? R has a +function for this. What about getting a (pseudorandom) sample? R also +has a function for this.


R +

+gapminder[sample(nrow(gapminder), 5), ]

To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later.

+ +

Challenge 3 +


Go to file -> new file -> R script, and write an R script to +load in the gapminder dataset. Put it in the scripts/ +directory and add it to version control.


Run the script using the source function, using the file +path as its argument (or by pressing the “source” button in +RStudio).

+ +

The source function can be used to use a script within a +script. Assume you would like to load the same type of file over and +over again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again and +again you could just write it once and save it as a script. Then, you +can use source("Your_Script_containing_the_load_function") +in a new script to use the function of that script without writing +everything again. Check out ?source to find out more.


R +

+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv(file = "data/gapminder_data.csv")

To run the script and load the data into the gapminder +variable:


R +

+source(file = "scripts/load-gapminder.R")
+ +

Challenge 4 +


Read the output of str(gapminder) again; this time, use +what you’ve learned about lists and vectors, as well as the output of +functions like colnames and dim to explain +what everything that str prints out for gapminder means. If +there are any parts you can’t interpret, discuss with your +neighbors!

+ +

The object gapminder is a data frame with columns

  • +country and continent are character +strings.
  • +
  • +year is an integer vector.
  • +
  • +pop, lifeExp, and gdpPercap +are numeric vectors.
  • +
+ +

Keypoints +

  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/06-data-subsetting.html b/06-data-subsetting.html new file mode 100644 index 000000000..1136db11c --- /dev/null +++ b/06-data-subsetting.html @@ -0,0 +1,1991 @@ + +R for Reproducible Scientific Analysis: Subsetting Data +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Subsetting Data


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I work with subsets of data in R?
  • +


  • To be able to subset vectors, factors, matrices, lists, and data +frames
  • +
  • To be able to extract individual and multiple elements: by index, by +name, using comparison operations
  • +
  • To be able to skip and remove elements from various data +structures.
  • +

R has many powerful subset operators. Mastering them will allow you +to easily perform complex operations on any kind of dataset.


There are six different ways we can subset any kind of object, and +three different subsetting operators for the different data +structures.


Let’s start with the workhorse of R: a simple numeric vector.


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+ +

Atomic vectors +


In R, simple vectors containing character strings, numbers, or +logical values are called atomic vectors because they can’t be +further simplified.


So now that we’ve created a dummy vector to play with, how do we get +at its contents?


Accessing elements using their indices +


To extract elements of a vector we can give their corresponding +index, starting from one:


R +




R +




It may look different, but the square brackets operator is a +function. For vectors (and matrices), it means “get me the nth +element”.


We can ask for multiple elements at once:


R +

+x[c(1, 3)]


  a   c 
+5.4 7.1 

Or slices of the vector:


R +



  a   b   c   d 
+5.4 6.2 7.1 4.8 

the : operator creates a sequence of numbers from the +left element to the right.


R +



[1] 1 2 3 4

R +

+c(1, 2, 3, 4)


[1] 1 2 3 4

We can ask for the same element multiple times:


R +



  a   a   c 
+5.4 5.4 7.1 

If we ask for an index beyond the length of the vector, R will return +a missing value:


R +



+  NA 

This is a vector of length one containing an NA, whose +name is also NA.


If we ask for the 0th element, we get an empty vector:


R +



named numeric(0)
+ +

Vector numbering in R starts at 1 +


In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1.


Skipping and removing elements +


If we use a negative number as the index of a vector, R will return +every element except for the one specified:


R +



  a   c   d   e 
+5.4 7.1 4.8 7.5 

We can skip multiple elements:


R +

+x[c(-1, -5)]  # or x[-c(1,5)]


  b   c   d 
+6.2 7.1 4.8 
+ +

Tip: Order of operations +


A common trip up for novices occurs when trying to skip slices of a +vector. It’s natural to try to negate a sequence like so:


R +


This gives a somewhat cryptic error:



Error in x[-1:3]: only 0's may be mixed with negative subscripts

But remember the order of operations. : is really a +function. It takes its first argument as -1, and its second as 3, so +generates the sequence of numbers: c(-1, 0, 1, 2, 3).


The correct solution is to wrap that function call in brackets, so +that the - operator applies to the result:


R +



  d   e 
+4.8 7.5 

To remove elements from a vector, we need to assign the result back +into the variable:


R +

+x <- x[-4]


  a   b   c   e 
+5.4 6.2 7.1 7.5 
+ +

Challenge 1 +


Given the following code:


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 

Come up with at least 2 different commands that will produce the +following output:



  b   c   d 
+6.2 7.1 4.8 

After you find 2 different commands, compare notes with your +neighbour. Did you have different strategies?

+ +

R +



  b   c   d 
+6.2 7.1 4.8 

R +



  b   c   d 
+6.2 7.1 4.8 

R +



  b   c   d 
+6.2 7.1 4.8 

Subsetting by name +


We can extract elements by using their name, instead of extracting by +index:


R +

+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
+x[c("a", "c")]


  a   c 
+5.4 7.1 

This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same!


Subsetting through other logical operations +


We can also use any logical vector to subset:


R +



  c   e 
+7.1 7.5 

Since comparison operators (e.g. >, +<, ==) evaluate to logical vectors, we can +also use them to succinctly subset vectors: the following statement +gives the same result as the previous one.


R +

+x[x > 7]


  c   e 
+7.1 7.5 

Breaking it down, this statement first evaluates x>7, +generating a logical vector +c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the +elements of x corresponding to the TRUE +values.


We can use == to mimic the previous method of indexing +by name (remember you have to use == rather than += for comparisons):


R +

+x[names(x) == "a"]


+ +

Tip: Combining logical conditions +


We often want to combine multiple logical criteria. For example, we +might want to find all the countries that are located in Asia +or Europe and have life expectancies +within a certain range. Several operations for combining logical vectors +exist in R:

  • +&, the “logical AND” operator: returns +TRUE if both the left and right are TRUE.
  • +
  • +|, the “logical OR” operator: returns +TRUE, if either the left or right (or both) are +TRUE.
  • +

You may sometimes see && and || +instead of & and |. These two-character +operators only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them for programming, i.e. deciding +whether to execute a statement.

  • +!, the “logical NOT” operator: converts +TRUE to FALSE and FALSE to +TRUE. It can negate a single logical condition (eg +!TRUE becomes FALSE), or a whole vector of +conditions(eg !c(TRUE, FALSE) becomes +c(FALSE, TRUE)).
  • +

Additionally, you can compare the elements within a single vector +using the all function (which returns TRUE if +every element of the vector is TRUE) and the +any function (which returns TRUE if one or +more elements of the vector are TRUE).

+ +

Challenge 2 +


Given the following code:


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 

Write a subsetting command to return the values in x that are greater +than 4 and less than 7.

+ +

R +

+x_subset <- x[x<7 & x>4]


  a   b   d 
+5.4 6.2 4.8 
+ +

Tip: Non-unique names +


You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have the +same name — although R tries to avoid this — but row names must be +unique.) Consider these examples:


R +

+x <- 1:3


[1] 1 2 3

R +

+names(x) <- c('a', 'a', 'a')


a a a 
+1 2 3 

R +

+x['a']  # only returns first value



R +

+x[names(x) == 'a']  # returns all three values


a a a 
+1 2 3 
+ +

Tip: Getting help for operators +


Remember you can search for help on operators by wrapping them in +quotes: help("%in%") or ?"%in%".


Skipping named elements +


Skipping or removing named elements is a little harder. If we try to +skip one named element by negating the string, R complains (slightly +obscurely) that it doesn’t know how to take the negative of a +string:


R +

+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'


Error in -"a": invalid argument to unary operator

However, we can use the != (not-equals) operator to +construct a logical vector that will do what we want:


R +

+x[names(x) != "a"]


  b   c   d   e 
+6.2 7.1 4.8 7.5 

Skipping multiple named indices is a little bit harder still. Suppose +we want to drop the "a" and "c" elements, so +we try this:


R +



Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length


  b   c   d   e 
+6.2 7.1 4.8 7.5 

R did something, but it gave us a warning that we ought to +pay attention to - and it apparently gave us the wrong answer +(the "c" element is still included in the vector)!


So what does != actually do in this case? That’s an +excellent question.




Let’s take a look at the comparison component of this code:


R +

+names(x) != c("a", "c")


Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length



Why does R give TRUE as the third element of this +vector, when names(x)[3] != "c" is obviously false? When +you use !=, R tries to compare each element of the left +argument with the corresponding element of its right argument. What +happens when you compare vectors of different lengths?

Inequality testing

When one vector is shorter than the other, it gets +recycled:

Inequality testing: results of recycling

In this case R repeats c("a", "c") as +many times as necessary to match names(x), i.e. we get +c("a","c","a","c","a"). Since the recycled "a" +doesn’t match the third element of names(x), the value of +!= is TRUE. Because in this case the longer +vector length (5) isn’t a multiple of the shorter vector length (2), R +printed a warning message. If we had been unlucky and +names(x) had contained six elements, R would +silently have done the wrong thing (i.e., not what we intended +it to do). This recycling rule can can introduce hard-to-find and subtle +bugs!


The way to get R to do what we really want (match each +element of the left argument with all of the elements of the +right argument) it to use the %in% operator. The +%in% operator goes through each element of its left +argument, in this case the names of x, and asks, “Does this +element occur in the second argument?”. Here, since we want to +exclude values, we also need a ! operator to +change “in” to “not in”:


R +

+x[! names(x) %in% c("a","c") ]


  b   d   e 
+6.2 4.8 7.5 
+ +

Challenge 3 +


Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains country and continent variables, but +no information between these two scales. Suppose we want to pull out +information from southeast Asia: how do we set up an operation to +produce a logical vector that is TRUE for all of the +countries in southeast Asia and FALSE otherwise?


Suppose you have these data:


R +

+seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries <- unique(as.character(gapminder$country))

There’s a wrong way (using only ==), which will give you +a warning; a clunky way (using the logical operators == and +|); and an elegant way (using %in%). See +whether you can come up with all three and explain how they (don’t) +work.

+ +
  • The wrong way to do this problem is +countries==seAsia. This gives a warning +("In countries == seAsia : longer object length is not a multiple of shorter object length") +and the wrong answer (a vector of all FALSE values), +because none of the recycled values of seAsia happen to +line up correctly with matching values in country.
  • +
  • The clunky (but technically correct) way to do this +problem is
  • +

R +

+ (countries=="Myanmar" | countries=="Thailand" |
+ countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")

(or countries==seAsia[1] | countries==seAsia[2] | ...). +This gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?).

  • The best way to do this problem is +countries %in% seAsia, which is both correct and easy to +type (and read).
  • +

Handling special values +


At some point you will encounter functions in R that cannot handle +missing, infinite, or undefined data.


There are a number of special functions you can use to filter out +this data:

  • +is.na will return all positions in a vector, matrix, or +data.frame containing NA (or NaN)
  • +
  • likewise, is.nan, and is.infinite will do +the same for NaN and Inf.
  • +
  • +is.finite will return all positions in a vector, +matrix, or data.frame that do not contain NA, +NaN or Inf.
  • +
  • +na.omit will filter out all missing values from a +vector
  • +

Factor subsetting +


Now that we’ve explored the different ways to subset vectors, how do +we subset the other data structures?


Factor subsetting works the same way as vector subsetting.


R +

+f <- factor(c("a", "a", "b", "c", "c", "d"))
+f[f == "a"]


[1] a a
+Levels: a b c d

R +

+f[f %in% c("b", "c")]


[1] b c c
+Levels: a b c d

R +



[1] a a b
+Levels: a b c d

Skipping elements will not remove the level even if no more of that +category exists in the factor:


R +



[1] a a c c d
+Levels: a b c d

Matrix subsetting +


Matrices are also subsetted using the [ function. In +this case it takes two arguments: the first applying to the rows, the +second to its columns:


R +

+m <- matrix(rnorm(6*4), ncol=4, nrow=6)
+m[3:4, c(3,1)]


            [,1]       [,2]
+[1,]  1.12493092 -0.8356286
+[2,] -0.04493361  1.5952808

You can leave the first or second arguments blank to retrieve all the +rows or columns respectively:


R +

+m[, c(3,4)]


            [,1]        [,2]
+[1,] -0.62124058  0.82122120
+[2,] -2.21469989  0.59390132
+[3,]  1.12493092  0.91897737
+[4,] -0.04493361  0.78213630
+[5,] -0.01619026  0.07456498
+[6,]  0.94383621 -1.98935170

If we only access one row or column, R will automatically convert the +result to a vector:


R +



[1] -0.8356286  0.5757814  1.1249309  0.9189774

If you want to keep the output as a matrix, you need to specify a +third argument; drop = FALSE:


R +

+m[3, , drop=FALSE]


           [,1]      [,2]     [,3]      [,4]
+[1,] -0.8356286 0.5757814 1.124931 0.9189774

Unlike vectors, if we try to access a row or column outside of the +matrix, R will throw an error:


R +

+m[, c(3,6)]


Error in m[, c(3, 6)]: subscript out of bounds
+ +

Tip: Higher dimensional arrays +


when dealing with multi-dimensional arrays, each argument to +[ corresponds to a dimension. For example, a 3D array, the +first three arguments correspond to the rows, columns, and depth +dimension.


Because matrices are vectors, we can also subset using only one +argument:


R +



[1] 0.3295078

This usually isn’t useful, and often confusing to read. However it is +useful to note that matrices are laid out in column-major +format by default. That is the elements of the vector are arranged +column-wise:


R +

+matrix(1:6, nrow=2, ncol=3)


     [,1] [,2] [,3]
+[1,]    1    3    5
+[2,]    2    4    6

If you wish to populate the matrix by row, use +byrow=TRUE:


R +

+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)


     [,1] [,2] [,3]
+[1,]    1    2    3
+[2,]    4    5    6

Matrices can also be subsetted using their rownames and column names +instead of their row and column indices.

+ +

Challenge 4 +


Given the following code:


R +

+m <- matrix(1:18, nrow=3, ncol=6)


     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    1    4    7   10   13   16
+[2,]    2    5    8   11   14   17
+[3,]    3    6    9   12   15   18
  1. Which of the following commands will extract the values 11 and +14?
  2. +

A. m[2,4,2,5]


B. m[2:5]


C. m[4:5,2]


D. m[2,c(4,5)]

+ +



List subsetting +


Now we’ll introduce some new subsetting operators. There are three +functions used to subset lists. We’ve already seen these when learning +about atomic vectors and matrices: [, [[, and +$.


Using [ will always return a list. If you want to +subset a list, but not extract an element, then you +will likely use [.


R +

+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))


+[1] "Software Carpentry"

This returns a list with one element.


We can subset elements of a list exactly the same way as atomic +vectors using [. Comparison operations however won’t work +as they’re not recursive, they will try to condition on the data +structures in each element of the list, not the individual elements +within those data structures.


R +



+[1] "Software Carpentry"
+ [1]  1  2  3  4  5  6  7  8  9 10

To extract individual elements of a list, you need to use the +double-square bracket function: [[.


R +



[1] "Software Carpentry"

Notice that now the result is a vector, not a list.


You can’t extract more than one element at once:


R +



Error in xlist[[1:2]]: subscript out of bounds

Nor use it to skip elements:


R +



Error in xlist[[-1]]: invalid negative subscript in get1index <real>

But you can use names to both subset and extract elements:


R +



[1] "Software Carpentry"

The $ function is a shorthand way for extracting +elements by name:


R +



                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
+Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+ +

Challenge 5 +


Given the following list:


R +

+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))

Using your knowledge of both list and vector subsetting, extract the +number 2 from xlist. Hint: the number 2 is contained within the “b” item +in the list.

+ +

R +



[1] 2

R +



[1] 2

R +



[1] 2
+ +

Challenge 6 +


Given a linear model:


R +

+mod <- aov(pop ~ lifeExp, data=gapminder)

Extract the residual degrees of freedom (hint: +attributes() will help you)

+ +

R +

+attributes(mod) ## `df.residual` is one of the names of `mod`

R +


Data frames +


Remember the data frames are lists underneath the hood, so similar +rules apply. However they are also two dimensional objects:


[ with one argument will act the same way as for lists, +where each list element corresponds to a column. The resulting object +will be a data frame:


R +



+1  8425333
+2  9240934
+3 10267083
+4 11537966
+5 13079460
+6 14880372

Similarly, [[ will act to extract a single +column:


R +



[1] 28.801 30.332 31.997 34.020 36.088 38.438

And $ provides a convenient shorthand to extract columns +by name:


R +



[1] 1952 1957 1962 1967 1972 1977

With two arguments, [ behaves the same way as for +matrices:


R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007

If we subset a single row, the result will be a data frame (because +the elements are mixed types):


R +



      country year      pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007

But for a single column the result will be a vector (this can be +changed with the third argument, drop = FALSE).

+ +

Challenge 7 +


Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
  2. +

R +

gapminder[gapminder$year = 1957,]
  1. Extract all columns except 1 through to 4
  2. +

R +

  1. Extract the rows where the life expectancy is longer the 80 +years
  2. +

R +

+gapminder[gapminder$lifeExp > 80]
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +

R +

+gapminder[1, 4, 5]
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +

R +

+gapminder[gapminder$year == 2002 | 2007,]
+ +

Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
  2. +

R +

+# gapminder[gapminder$year = 1957,]
+gapminder[gapminder$year == 1957,]
  1. Extract all columns except 1 through to 4
  2. +

R +

+# gapminder[,-1:4]
  1. Extract the rows where the life expectancy is longer than 80 +years
  2. +

R +

+# gapminder[gapminder$lifeExp > 80]
+gapminder[gapminder$lifeExp > 80,]
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +

R +

+# gapminder[1, 4, 5]
+gapminder[1, c(4, 5)]
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +

R +

+# gapminder[gapminder$year == 2002 | 2007,]
+gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
+gapminder[gapminder$year %in% c(2002, 2007),]
+ +

Challenge 8 +

  1. Why does gapminder[1:20] return an error? How does +it differ from gapminder[1:20, ]?

  2. +
  3. Create a new data.frame called +gapminder_small that only contains rows 1 through 9 and 19 +through 23. You can do this in one or two steps.

  4. +
+ +
  1. gapminder is a data.frame so needs to be subsetted +on two dimensions. gapminder[1:20, ] subsets the data to +give the first 20 rows and all columns.

  2. +
  3. +
  4. +

R +

+gapminder_small <- gapminder[c(1:9, 19:23),]
+ +

Keypoints +

  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/07-control-flow.html b/07-control-flow.html new file mode 100644 index 000000000..590210c89 --- /dev/null +++ b/07-control-flow.html @@ -0,0 +1,1247 @@ + +R for Reproducible Scientific Analysis: Control Flow +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Control Flow


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I make data-dependent choices in R?
  • +
  • How can I repeat operations in R?
  • +


  • Write conditional statements with if...else statements +and ifelse().
  • +
  • Write and understand for() loops.
  • +

Often when we’re coding we want to control the flow of our actions. +This can be done by setting actions to occur only if a condition or a +set of conditions are met. Alternatively, we can also set an action to +occur a particular number of times.


There are several ways you can control flow in R. For conditional +statements, the most commonly used approaches are the constructs:


R +

# if
+if (condition is true) {
+  perform action
+# if ... else
+if (condition is true) {
+  perform action
+} else {  # that is, if the condition is false,
+  perform alternative action

Say, for example, that we want R to print a message if a variable +x has a particular value:


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")


[1] 8

The print statement does not appear in the console because x is not +greater than 10. To print a different message for numbers less than 10, +we can add an else statement.


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else {
+  print("x is less than 10")


[1] "x is less than 10"

You can also test multiple conditions by using +else if.


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else if (x > 5) {
+  print("x is greater than 5, but less than 10")
+} else {
+  print("x is less than 5")


[1] "x is greater than 5, but less than 10"

Important: when R evaluates the condition inside +if() statements, it is looking for a logical element, i.e., +TRUE or FALSE. This can cause some headaches +for beginners. For example:


R +

+x  <-  4 == 3
+if (x) {
+  "4 equals 3"
+} else {
+  "4 does not equal 3"


[1] "4 does not equal 3"

As we can see, the not equal message was printed because the vector x +is FALSE


R +

+x <- 4 == 3


+ +

Challenge 1 +


Use an if() statement to print a suitable message +reporting whether there are any records from 2002 in the +gapminder dataset. Now do the same for 2012.

+ +

We will first see a solution to Challenge 1 which does not use the +any() function. We first obtain a logical vector describing +which element of gapminder$year is equal to +2002:


R +

+gapminder[(gapminder$year == 2002),]

Then, we count the number of rows of the data.frame +gapminder that correspond to the 2002:


R +

+rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])

The presence of any record for the year 2002 is equivalent to the +request that rows2002_number is one or more:


R +

+rows2002_number >= 1

Putting all together, we obtain:


R +

+if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
+   print("Record(s) for the year 2002 found.")

All this can be done more quickly with any(). The +logical condition can be expressed as:


R +

+if(any(gapminder$year == 2002)){
+   print("Record(s) for the year 2002 found.")

Did anyone get a warning message like this?



Error in if (gapminder$year == 2012) {: the condition has length > 1

The if() function only accepts singular (of length 1) +inputs, and therefore returns an error when you use it with a vector. +The if() function will still run, but will only evaluate +the condition in the first element of the vector. Therefore, to use the +if() function, you need to make sure your input is singular +(of length 1).

+ +

Tip: Built in ifelse() +function +


R accepts both if() and +else if() statements structured as outlined above, but also +statements using R’s built-in ifelse() +function. This function accepts both singular and vector inputs and is +structured as follows:


R +

# ifelse function
+ifelse(condition is true, perform action, perform alternative action)

where the first argument is the condition or a set of conditions to +be met, the second argument is the statement that is evaluated when the +condition is TRUE, and the third statement is the statement +that is evaluated when the condition is FALSE.


R +

+y <- -3
+ifelse(y < 0, "y is a negative number", "y is either positive or zero")


[1] "y is a negative number"
+ +

Tip: any() and +all() +


The any() function will return TRUE if at +least one TRUE value is found within a vector, otherwise it +will return FALSE. This can be used in a similar way to the +%in% operator. The function all(), as the name +suggests, will only return TRUE if all values in the vector +are TRUE.


Repeating operations +


If you want to iterate over a set of values, when the order of +iteration is important, and perform the same operation on each, a +for() loop will do the job. We saw for() loops +in the shell +lessons earlier. This is the most flexible of looping operations, +but therefore also the hardest to use correctly. In general, the advice +of many R users would be to learn about for() +loops, but to avoid using for() loops unless the order of +iteration is important: i.e. the calculation at each iteration depends +on the results of previous iterations. If the order of iteration is not +important, then you should learn about vectorized alternatives, such as +the purrr package, as they pay off in computational +efficiency.


The basic structure of a for() loop is:


R +

for (iterator in set of values) {
+  do a thing

For example:


R +

+for (i in 1:10) {
+  print(i)


[1] 1
+[1] 2
+[1] 3
+[1] 4
+[1] 5
+[1] 6
+[1] 7
+[1] 8
+[1] 9
+[1] 10

The 1:10 bit creates a vector on the fly; you can +iterate over any other vector as well.


We can use a for() loop nested within another +for() loop to iterate over two things at once.


R +

+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    print(paste(i,j))
+  }


[1] "1 a"
+[1] "1 b"
+[1] "1 c"
+[1] "1 d"
+[1] "1 e"
+[1] "2 a"
+[1] "2 b"
+[1] "2 c"
+[1] "2 d"
+[1] "2 e"
+[1] "3 a"
+[1] "3 b"
+[1] "3 c"
+[1] "3 d"
+[1] "3 e"
+[1] "4 a"
+[1] "4 b"
+[1] "4 c"
+[1] "4 d"
+[1] "4 e"
+[1] "5 a"
+[1] "5 b"
+[1] "5 c"
+[1] "5 d"
+[1] "5 e"

We notice in the output that when the first index (i) is +set to 1, the second index (j) iterates through its full +set of indices. Once the indices of j have been iterated +through, then i is incremented. This process continues +until the last index has been used for each for() loop.


Rather than printing the results, we could write the loop output to a +new object.


R +

+output_vector <- c()
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    temp_output <- paste(i, j)
+    output_vector <- c(output_vector, temp_output)
+  }


 [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b"
+[13] "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b" "5 c" "5 d"
+[25] "5 e"

This approach can be useful, but ‘growing your results’ (building the +result object incrementally) is computationally inefficient, so avoid it +when you are iterating through a lot of values.

+ +

Tip: don’t grow your results +


One of the biggest things that trips up novices and experienced R +users alike, is building a results object (vector, list, matrix, data +frame) as your for loop progresses. Computers are very bad at handling +this, so your calculations can very quickly slow to a crawl. It’s much +better to define an empty results object before hand of appropriate +dimensions, rather than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location.


A better way is to define your (empty) output object before filling +in the values. For this example, it looks more involved, but is still +more efficient.


R +

+output_matrix <- matrix(nrow = 5, ncol = 5)
+j_vector <- c('a', 'b', 'c', 'd', 'e')
+for (i in 1:5) {
+  for (j in 1:5) {
+    temp_j_value <- j_vector[j]
+    temp_output <- paste(i, temp_j_value)
+    output_matrix[i, j] <- temp_output
+  }
+output_vector2 <- as.vector(output_matrix)


 [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c" "2 c"
+[13] "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e" "3 e" "4 e"
+[25] "5 e"
+ +

Tip: While loops +


Sometimes you will find yourself needing to repeat an operation as +long as a certain condition is met. You can do this with a +while() loop.


R +

while(this condition is true){
+  do a thing

R will interpret a condition being met as “TRUE”.


As an example, here’s a while loop that generates random numbers from +a uniform distribution (the runif() function) between 0 and +1 until it gets one that’s less than 0.1.


R +

+z <- 1
+while(z > 0.1){
+  z <- runif(1)
+  cat(z, "\n")

while() loops will not always be appropriate. You have +to be particularly careful that you don’t end up stuck in an infinite +loop because your condition is always met and hence the while statement +never terminates.

+ +

Challenge 2 +


Compare the objects output_vector and +output_vector2. Are they the same? If not, why not? How +would you change the last block of code to make +output_vector2 the same as output_vector?

+ +

We can check whether the two vectors are identical using the +all() function:


R +

+all(output_vector == output_vector2)

However, all the elements of output_vector can be found +in output_vector2:


R +

+all(output_vector %in% output_vector2)

and vice versa:


R +

+all(output_vector2 %in% output_vector)

therefore, the element in output_vector and +output_vector2 are just sorted in a different order. This +is because as.vector() outputs the elements of an input +matrix going over its column. Taking a look at +output_matrix, we can notice that we want its elements by +rows. The solution is to transpose the output_matrix. We +can do it either by calling the transpose function t() or +by inputting the elements in the right order. The first solution +requires to change the original


R +

+output_vector2 <- as.vector(output_matrix)



R +

+output_vector2 <- as.vector(t(output_matrix))

The second solution requires to change


R +

+output_matrix[i, j] <- temp_output



R +

+output_matrix[j, i] <- temp_output
+ +

Challenge 3 +


Write a script that loops through the gapminder data by +continent and prints out whether the mean life expectancy is smaller or +larger than 50 years.

+ +

Step 1: We want to make sure we can extract all the +unique values of the continent vector


R +

+gapminder <- read.csv("data/gapminder_data.csv")

Step 2: We also need to loop over each of these +continents and calculate the average life expectancy for each +subset of data. We can do that as follows:

  1. Loop over each of the unique values of ‘continent’
  2. +
  3. For each value of continent, create a temporary variable storing +that subset
  4. +
  5. Return the calculated life expectancy to the user by printing the +output:
  6. +

R +

+for (iContinent in unique(gapminder$continent)) {
+  tmp <- gapminder[gapminder$continent == iContinent, ]
+  cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n")
+  rm(tmp)

Step 3: The exercise only wants the output printed +if the average life expectancy is less than 50 or greater than 50. So we +need to add an if() condition before printing, which +evaluates whether the calculated average life expectancy is above or +below a threshold, and prints an output conditional on the result. We +need to amend (3) from above:


3a. If the calculated life expectancy is less than some threshold (50 +years), return the continent and a statement that life expectancy is +less than threshold, otherwise return the continent and a statement that +life expectancy is greater than threshold:


R +

+thresholdValue <- 50
+for (iContinent in unique(gapminder$continent)) {
+   tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"])
+   if (tmp < thresholdValue){
+       cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+   } else {
+       cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+   } # end if else condition
+   rm(tmp)
+} # end for loop
+ +

Challenge 4 +


Modify the script from Challenge 3 to loop over each country. This +time print out whether the life expectancy is smaller than 50, between +50 and 70, or greater than 70.

+ +

We modify our solution to Challenge 3 by now adding two thresholds, +lowerThreshold and upperThreshold and +extending our if-else statements:


R +

+ lowerThreshold <- 50
+ upperThreshold <- 70
+for (iCountry in unique(gapminder$country)) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+    if(tmp < lowerThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
+    } else if(tmp > lowerThreshold && tmp < upperThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
+    } else {
+        cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
+    }
+    rm(tmp)
+ +

Challenge 5 - Advanced +


Write a script that loops over each country in the +gapminder dataset, tests whether the country starts with a +‘B’, and graphs life expectancy against time as a line graph if the mean +life expectancy is under 50 years.

+ +

We will use the grep() command that was introduced in +the Unix +Shell lesson to find countries that start with “B.” Lets understand +how to do this first. Following from the Unix shell section we may be +tempted to try the following


R +

+grep("^B", unique(gapminder$country))

But when we evaluate this command it returns the indices of the +factor variable country that start with “B.” To get the +values, we must add the value=TRUE option to the +grep() command:


R +

+grep("^B", unique(gapminder$country), value = TRUE)

We will now store these countries in a variable called +candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each +country, and if the average life expectancy is less than 50 we use +base-plot to plot the evolution of average life expectancy using +with() and subset():


R +

+thresholdValue <- 50
+candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE)
+for (iCountry in candidateCountries) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+    if (tmp < thresholdValue) {
+        cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+        with(subset(gapminder, country == iCountry),
+                plot(year, lifeExp,
+                     type = "o",
+                     main = paste("Life Expectancy in", iCountry, "over time"),
+                     ylab = "Life Expectancy",
+                     xlab = "Year"
+                     ) # end plot
+             ) # end with
+    } # end if
+    rm(tmp)
+} # end for loop
+ +

Keypoints +

  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/08-plot-ggplot2.html b/08-plot-ggplot2.html new file mode 100644 index 000000000..c9592d5af --- /dev/null +++ b/08-plot-ggplot2.html @@ -0,0 +1,1105 @@ + +R for Reproducible Scientific Analysis: Creating Publication-Quality Graphics with ggplot2 +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Creating Publication-Quality Graphics with ggplot2


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I create publication-quality graphics in R?
  • +


  • To be able to use ggplot2 to generate publication-quality +graphics.
  • +
  • To apply geometry, aesthetic, and statistics layers to a ggplot +plot.
  • +
  • To manipulate the aesthetics of a plot using different colors, +shapes, and lines.
  • +
  • To improve data visualization through transforming scales and +paneling by group.
  • +
  • To save a plot created with ggplot to disk.
  • +

Plotting our data is one of the best ways to quickly explore it and +the various relationships between variables.


There are three main plotting systems in R, the base plotting +system, the lattice +package, and the ggplot2 +package.


Today we’ll be learning about the ggplot2 package, because it is the +most effective for creating publication-quality graphics.


ggplot2 is built on the grammar of graphics, the idea that any plot +can be built from the same set of components: a data +set, mapping aesthetics, and graphical +layers:

  • Data sets are the data that you, the user, +provide.

  • +
  • Mapping aesthetics are what connect the data to +the graphics. They tell ggplot2 how to use your data to affect how the +graph looks, such as changing what is plotted on the X or Y axis, or the +size or color of different data points.

  • +
  • Layers are the actual graphical output from +ggplot2. Layers determine what kinds of plot are shown (scatterplot, +histogram, etc.), the coordinate system used (rectangular, polar, +others), and other important aspects of the plot. The idea of layers of +graphics may be familiar to you if you have used image editing programs +like Photoshop, Illustrator, or Inkscape.

  • +

Let’s start off building an example using the gapminder data from +earlier. The most basic function is ggplot, which lets R +know that we’re creating a new plot. Any of the arguments we give the +ggplot function are the global options for the +plot: they apply to all layers on the plot.


R +

+ggplot(data = gapminder)
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to +show on our figure. This is not enough information for +ggplot to actually draw anything. It only creates a blank +slate for other elements to be added to.


Now we’re going to add in the mapping aesthetics +using the aes function. aes tells +ggplot how variables in the data map to +aesthetic properties of the figure, such as which columns of +the data should be used for the x and +y locations.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

Here we told ggplot we want to plot the “gdpPercap” +column of the gapminder data frame on the x-axis, and the “lifeExp” +column on the y-axis. Notice that we didn’t need to explicitly pass +aes these columns +(e.g. x = gapminder[, "gdpPercap"]), this is because +ggplot is smart enough to know to look in the +data for that column!


The final part of making our plot is to tell ggplot how +we want to visually represent the data. We do this by adding a new +layer to the plot using one of the +geom functions.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

Here we used geom_point, which tells ggplot +we want to visually represent the relationship between +x and y as a scatterplot of +points.

+ +

Challenge 1 +


Modify the example so that the figure shows how life expectancy has +changed over time:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()

Hint: the gapminder dataset has a column called “year”, which should +appear on the x-axis.

+ +

Here is one possible solution:


R +

+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +
+ +

Challenge 2 +


In the previous examples and challenge we’ve used the +aes function to tell the scatterplot geom +about the x and y locations of each +point. Another aesthetic property we can modify is the point +color. Modify the code from the previous challenge to +color the points by the “continent” column. What trends +do you see in the data? Are they what you expected?

+ +

The solution presented below adds color=continent to the +call of the aes function. The general trend seems to +indicate an increased life expectancy over the years. On continents with +stronger economies we find a longer life expectancy.


R +

+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_point()
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +

Layers +


Using a scatterplot probably isn’t the best for visualizing change +over time. Instead, let’s tell ggplot to visualize the data +as a line plot:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
+  geom_line()

Instead of adding a geom_point layer, we’ve added a +geom_line layer.


However, the result doesn’t look quite as we might have expected: it +seems to be jumping around a lot in each continent. Let’s try to +separate the data by country, plotting one line for each country:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line()

We’ve added the group aesthetic, which +tells ggplot to draw a line for each country.


But what if we want to visualize both lines and points on the plot? +We can add another layer to the plot:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line() + geom_point()

It’s important to note that each layer is drawn on top of the +previous layer. In this example, the points have been drawn on top +of the lines. Here’s a demonstration:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+  geom_line(mapping = aes(color=continent)) + geom_point()

In this example, the aesthetic mapping of +color has been moved from the global plot options in +ggplot to the geom_line layer so it no longer +applies to the points. Now we can clearly see that the points are drawn +on top of the lines.

+ +

Tip: Setting an aesthetic to a value instead +of a mapping +


So far, we’ve seen how to use an aesthetic (such as +color) as a mapping to a variable in the data. +For example, when we use +geom_line(mapping = aes(color=continent)), ggplot will give +a different color to each continent. But what if we want to change the +color of all lines to blue? You may think that +geom_line(mapping = aes(color="blue")) should work, but it +doesn’t. Since we don’t want to create a mapping to a specific variable, +we can move the color specification outside of the aes() +function, like this: geom_line(color="blue").

+ +

Challenge 3 +


Switch the order of the point and line layers from the previous +example. What happened?

+ +

The lines now get drawn over the points!


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_point() + geom_line(mapping = aes(color=continent))
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

Transformations and statistics +


ggplot2 also makes it easy to overlay statistical models over the +data. To demonstrate we’ll go back to our first example:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()

Currently it’s hard to see the relationship between the points due to +some strong outliers in GDP per capita. We can change the scale of units +on the x axis using the scale functions. These control the +mapping between the data values and visual values of an aesthetic. We +can also modify the transparency of the points, using the alpha +function, which is especially helpful when you have a large amount of +data which is very clustered.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10()
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

The scale_x_log10 function applied a transformation to +the coordinate system of the plot, so that each multiple of 10 is evenly +spaced from left to right. For example, a GDP per capita of 1,000 is the +same horizontal distance away from a value of 10,000 as the 10,000 value +is from 100,000. This helps to visualize the spread of the data along +the x-axis.

+ +

Tip Reminder: Setting an aesthetic to a value +instead of a mapping +


Notice that we used geom_point(alpha = 0.5). As the +previous tip mentioned, using a setting outside of the +aes() function will cause this value to be used for all +points, which is what we want in this case. But just like any other +aesthetic setting, alpha can also be mapped to a variable in +the data. For example, we can give a different transparency to each +continent with +geom_point(mapping = aes(alpha = continent)).


We can fit a simple relationship to the data by adding another layer, +geom_smooth:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the +size aesthetic in the geom_smooth +layer:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", size=1.5)


Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we +set the size aesthetic by passing it as an +argument to geom_smooth. Previously in the lesson we’ve +used the aes function to define a mapping between +data variables and their visual representation.

+ +

Challenge 4a +


Modify the color and size of the points on the point layer in the +previous example.


Hint: do not use the aes function.

+ +

Here a possible solution: Notice that the color argument +is supplied outside of the aes() function. This means that +it applies to all data points on the graph and is not related to a +specific variable.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(size=3, color="orange") + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+ +

Challenge 4b +


Modify your solution to Challenge 4a so that the points are now a +different shape and are colored by continent with new trendlines. Hint: +The color argument can be used inside the aesthetic.

+ +

Here is a possible solution: Notice that supplying the +color argument inside the aes() functions +enables you to connect it to a certain variable. The shape +argument, as you can see, modifies all data points the same way (it is +outside the aes() call) while the color +argument which is placed inside the aes() call modifies a +point’s color based on its continent value.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
+ geom_point(size=3, shape=17) + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)


`geom_smooth()` using formula = 'y ~ x'

Multi-panel figures +


Earlier we visualized the change in life expectancy over time across +all countries in one plot. Alternatively, we can split this out over +multiple panels by adding a layer of facet panels.

+ +

Tip +


We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to clutter +the figure. Note that we apply a “theme” definition to rotate the x-axis +labels to maintain readability. Nearly everything in ggplot2 is +customizable.


R +

+americas <- gapminder[gapminder$continent == "Americas",]
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

The facet_wrap layer took a “formula” as its argument, +denoted by the tilde (~). This tells R to draw a panel for each unique +value in the country column of the gapminder dataset.


Modifying text +


To clean this figure up for a publication we need to change some of +the text elements. The x-axis is too cluttered, and the y axis should +read “Life expectancy”, rather than the column name in the data +frame.


We can do this by adding a couple of different layers. The +theme layer controls the axis text, and overall text +size. Labels for the axes, plot title and any legend can be set using +the labs function. Legend titles are set using the same +names we used in the aes specification. Thus below the +color legend title is set using color = "Continent", while +the title of a fill legend would be set using +fill = "MyTitle".


R +

+ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Exporting the plot +


The ggsave() function allows you to export a plot +created with ggplot. You can specify the dimension and resolution of +your plot by adjusting the appropriate arguments (width, +height and dpi) to create high quality +graphics for publication. In order to save the plot from above, we first +assign it to a variable lifeExp_plot, then tell +ggsave to save that plot in png format to a +directory called results. (Make sure you have a +results/ folder in your working directory.)


R +

+lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")

There are two nice things about ggsave. First, it +defaults to the last plot, so if you omit the plot argument +it will automatically save the last plot you created with +ggplot. Secondly, it tries to determine the format you want +to save your plot in from the file extension you provide for the +filename (for example .png or .pdf). If you +need to, you can specify the format explicitly in the +device argument.


This is a taste of what you can do with ggplot2. RStudio provides a +really useful cheat +sheet of the different layers available, and more extensive +documentation is available on the ggplot2 website. All +RStudio cheat sheets can be found here. Finally, +if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow +with reusable code to modify!

+ +

Challenge 5 +


Generate boxplots to compare life expectancy between the different +continents during the available years.



  • Rename y axis as Life Expectancy.
  • +
  • Remove x axis labels.
  • +
+ +

Here a possible solution: xlab() and ylab() +set labels for the x and y axes, respectively The axis title, text and +ticks are attributes of the theme and must be modified within a +theme() call.


R +

+ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
+ geom_boxplot() + facet_wrap(~year) +
+ ylab("Life Expectancy") +
+ theme(axis.title.x=element_blank(),
+       axis.text.x = element_blank(),
+       axis.ticks.x = element_blank())
+ +

Keypoints +

  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/09-vectorization.html b/09-vectorization.html new file mode 100644 index 000000000..663ee4ba0 --- /dev/null +++ b/09-vectorization.html @@ -0,0 +1,1020 @@ + +R for Reproducible Scientific Analysis: Vectorization +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +



Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I operate on all the elements of a vector at once?
  • +


  • To understand vectorized operations in R.
  • +

Most of R’s functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through and +act on each element one at a time. This makes writing code more concise, +easy to read, and less error prone.


R +

+x <- 1:4
+x * 2


[1] 2 4 6 8

The multiplication happened to each element of the vector.


We can also add two vectors together:


R +

+y <- 6:9
+x + y


[1]  7  9 11 13

Each element of x was added to its corresponding element +of y:


R +

x:  1  2  3  4
+    +  +  +  +
+y:  6  7  8  9
+    7  9 11 13

Here is how we would add two vectors together using a for loop:


R +

+output_vector <- c()
+for (i in 1:4) {
+  output_vector[i] <- x[i] + y[i]


[1]  7  9 11 13

Compare this to the output using vectorised operations.


R +

+sum_xy <- x + y


[1]  7  9 11 13
+ +

Challenge 1 +


Let’s try this on the pop column of the +gapminder dataset.


Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+ +

Let’s try this on the pop column of the +gapminder dataset.


Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.


R +

+gapminder$pop_millions <- gapminder$pop / 1e6


      country year      pop continent lifeExp gdpPercap pop_millions
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453     8.425333
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530     9.240934
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007    10.267083
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971    11.537966
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811    13.079460
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134    14.880372
+ +

Challenge 2 +


On a single graph, plot population, in millions, against year, for +all countries. Do not worry about identifying which country is +which.


Repeat the exercise, graphing only for China, India, and Indonesia. +Again, do not worry about which is which.

+ +

Refresh your plotting skills by plotting population in millions +against year.


R +

+ggplot(gapminder, aes(x = year, y = pop_millions)) +
+ geom_point()
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

R +

+countryset <- c("China","India","Indonesia")
+ggplot(gapminder[gapminder$country %in% countryset,],
+       aes(x = year, y = pop_millions)) +
+  geom_point()
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

Comparison operators, logical operators, and many functions are also +vectorized:


Comparison operators


R +

+x > 2



Logical operators


R +

+a <- x > 3  # or, for clarity, a <- (x > 3)


+ +

Tip: some useful functions for logical +vectors +


any() will return TRUE if any +element of a vector is TRUE.
all() will return TRUE if all +elements of a vector are TRUE.


Most functions also operate element-wise on vectors:




R +

+x <- 1:4


[1] 0.0000000 0.6931472 1.0986123 1.3862944

Vectorized operations work element-wise on matrices:


R +

+m <- matrix(1:12, nrow=3, ncol=4)
+m * -1


     [,1] [,2] [,3] [,4]
+[1,]   -1   -4   -7  -10
+[2,]   -2   -5   -8  -11
+[3,]   -3   -6   -9  -12
+ +

Tip: element-wise vs. matrix +multiplication +


Very important: the operator * gives you element-wise +multiplication! To do matrix multiplication, we need to use the +%*% operator:


R +

+m %*% matrix(1, nrow=4, ncol=1)


+[1,]   22
+[2,]   26
+[3,]   30

R +

+matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)


+[1,]   30

For more on matrix algebra, see the Quick-R +reference guide

+ +

Challenge 3 +


Given the following matrix:


R +

+m <- matrix(1:12, nrow=3, ncol=4)


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12

Write down what you think will happen when you run:

  1. m ^ -1
  2. +
  3. m * c(1, 0, -1)
  4. +
  5. m > c(0, 20)
  6. +
  7. m * c(1, 0, -1, 2)
  8. +

Did you get the output you expected? If not, ask a helper!

+ +

Given the following matrix:


R +

+m <- matrix(1:12, nrow=3, ncol=4)


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12

Write down what you think will happen when you run:

  1. m ^ -1
  2. +


          [,1]      [,2]      [,3]       [,4]
+[1,] 1.0000000 0.2500000 0.1428571 0.10000000
+[2,] 0.5000000 0.2000000 0.1250000 0.09090909
+[3,] 0.3333333 0.1666667 0.1111111 0.08333333
  1. m * c(1, 0, -1)
  2. +


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    0    0    0    0
+[3,]   -3   -6   -9  -12
  1. m > c(0, 20)
  2. +


      [,1]  [,2]  [,3]  [,4]
+ +

Challenge 4 +


We’re interested in looking at the sum of the following sequence of +fractions:


R +

+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000?

+ +

We’re interested in looking at the sum of the following sequence of +fractions:


R +

+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

This would be tedious to type out, and impossible for high values of +n. Can you use vectorisation to compute x, when n=100? How about when +n=10,000?


R +



[1] 1.634984

R +



[1] 1.644834

R +

+n <- 10000


[1] 1.644834

We can also obtain the same results using a function:


R +

+inverse_sum_of_squares <- function(n) {
+  sum(1/(1:n)^2)


[1] 1.634984

R +



[1] 1.644834

R +

+n <- 10000


[1] 1.644834
+ +

Tip: Operations on vectors of unequal +length +


Operations can also be performed on vectors of unequal length, +through a process known as recycling. This process +automatically repeats the smaller vector until it matches the length of +the larger vector. R will provide a warning if the larger vector is not +a multiple of the smaller vector.


R +

+x <- c(1, 2, 3)
+y <- c(1, 2, 3, 4, 5, 6, 7)
+x + y


Warning in x + y: longer object length is not a multiple of shorter object


[1] 2 4 6 5 7 9 8

Vector x was recycled to match the length of vector +y


R +

x:  1  2  3  1  2  3  1
+    +  +  +  +  +  +  +
+y:  1  2  3  4  5  6  7
+    2  4  6  5  7  9  8
+ +

Keypoints +

  • Use vectorized operations instead of loops.
  • +
+ + + +
+ + +
+ +
Back To Top +
+ + diff --git a/10-functions.html b/10-functions.html new file mode 100644 index 000000000..fb5c0fc70 --- /dev/null +++ b/10-functions.html @@ -0,0 +1,1221 @@ + +R for Reproducible Scientific Analysis: Functions Explained +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Functions Explained


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I write a new function in R?
  • +


  • Define a function that takes arguments.
  • +
  • Return a value from a function.
  • +
  • Check argument conditions with stopifnot() in +functions.
  • +
  • Test a function.
  • +
  • Set default values for function arguments.
  • +
  • Explain why we should divide programs into small, single-purpose +functions.
  • +

If we only had one data set to analyze, it would probably be faster +to load the file into a spreadsheet and use that to plot simple +statistics. However, the gapminder data is updated periodically, and we +may want to pull in that new information later and re-run our analysis +again. We may also obtain similar data from a different source in the +future.


In this lesson, we’ll learn how to write a function so that we can +repeat several operations with a single command.

+ +

What is a function? +


Functions gather a sequence of operations into a whole, preserving it +for ongoing use. Functions provide:

  • a name we can remember and invoke it by
  • +
  • relief from the need to remember the individual operations
  • +
  • a defined set of inputs and expected outputs
  • +
  • rich connections to the larger programming environment
  • +

As the basic building block of most programming languages, +user-defined functions constitute “programming” as much as any single +abstraction can. If you have written a function, you are a computer +programmer.


Defining a function +


Let’s open a new R script file in the functions/ +directory and call it functions-lesson.R.


The general structure of a function is:


R +

+my_function <- function(parameters) {
+  # perform action
+  # return value

Let’s define a function fahr_to_kelvin() that converts +temperatures from Fahrenheit to Kelvin:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

We define fahr_to_kelvin() by assigning it to the output +of function. The list of argument names are contained +within parentheses. Next, the body of +the function–the statements that are executed when it runs–is contained +within curly braces ({}). The statements in the body are +indented by two spaces. This makes the code easier to read but does not +affect how the code operates.


It is useful to think of creating functions like writing a cookbook. +First you define the “ingredients” that your function needs. In this +case, we only need one ingredient to use our function: “temp”. After we +list our ingredients, we then say what we will do with them, in this +case, we are taking our ingredient and applying a set of mathematical +operators to it.


When we call the function, the values we pass to it as arguments are +assigned to those variables so that we can use them inside the function. +Inside the function, we use a return statement to send a +result back to whoever asked for it.

+ +

Tip +


One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the +body of the function. But for clarity, we will explicitly define the +return statement.


Let’s try running our function. Calling our own function is no +different from calling any other function:


R +

+# freezing point of water


[1] 273.15

R +

+# boiling point of water


[1] 373.15
+ +

Challenge 1 +


Write a function called kelvin_to_celsius() that takes a +temperature in Kelvin and returns that temperature in Celsius.


Hint: To convert from Kelvin to Celsius you subtract 273.15

+ +

Write a function called kelvin_to_celsius that takes a +temperature in Kelvin and returns that temperature in Celsius


R +

+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)

Combining functions +


The real power of functions comes from mixing, matching and combining +them into ever-larger chunks to get the effect we want.


Let’s define two functions that will convert temperature from +Fahrenheit to Kelvin, and Kelvin to Celsius:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+kelvin_to_celsius <- function(temp) {
+  celsius <- temp - 273.15
+  return(celsius)
+ +

Challenge 2 +


Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer).

+ +

Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above


R +

+fahr_to_celsius <- function(temp) {
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)

Interlude: Defensive Programming +


Now that we’ve begun to appreciate how writing functions provides an +efficient way to make R code re-usable and modular, we should note that +it is important to ensure that functions only work in their intended +use-cases. Checking function parameters is related to the concept of +defensive programming. Defensive programming encourages us to +frequently check conditions and throw an error if something is wrong. +These checks are referred to as assertion statements because we want to +assert some condition is TRUE before proceeding. They make +it easier to debug because they give us a better idea of where the +errors originate.


Checking conditions with stopifnot() +


Let’s start by re-examining fahr_to_kelvin(), our +function for converting temperatures from Fahrenheit to Kelvin. It was +defined like so:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

For this function to work as intended, the argument temp +must be a numeric value; otherwise, the mathematical +procedure for converting between the two temperature scales will not +work. To create an error, we can use the function stop(). +For example, since the argument temp must be a +numeric vector, we could check for this condition with an +if statement and throw an error if the condition was +violated. We could augment our function above like so:


R +

+fahr_to_kelvin <- function(temp) {
+  if (!is.numeric(temp)) {
+    stop("temp must be a numeric vector.")
+  }
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

If we had multiple conditions or arguments to check, it would take +many lines of code to check all of them. Luckily R provides the +convenience function stopifnot(). We can list as many +requirements that should evaluate to TRUE; +stopifnot() throws an error if it finds one that is +FALSE. Listing these conditions also serves a secondary +purpose as extra documentation for the function.


Let’s try out defensive programming with stopifnot() by +adding assertions to check the input to our function +fahr_to_kelvin().


We want to assert the following: temp is a numeric +vector. We may do that like so:


R +

+fahr_to_kelvin <- function(temp) {
+  stopifnot(is.numeric(temp))
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

It still works when given proper input.


R +

+# freezing point of water
+fahr_to_kelvin(temp = 32)


[1] 273.15

But fails instantly if given improper input.


R +

+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp = as.factor(32))


Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+ +

Challenge 3 +


Use defensive programming to ensure that our +fahr_to_celsius() function throws an error immediately if +the argument temp is specified inappropriately.

+ +

Extend our previous definition of the function by adding in an +explicit call to stopifnot(). Since +fahr_to_celsius() is a composition of two other functions, +checking inside here makes adding checks to the two component functions +redundant.


R +

+fahr_to_celsius <- function(temp) {
+  stopifnot(is.numeric(temp))
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)

More on combining functions +


Now, we’re going to define a function that calculates the Gross +Domestic Product of a nation from the data available in our dataset:


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat) {
+  gdp <- dat$pop * dat$gdpPercap
+  return(gdp)

We define calcGDP() by assigning it to the output of +function. The list of argument names are contained within +parentheses. Next, the body of the function -- the statements executed +when you call the function – is contained within curly braces +({}).


We’ve indented the statements in the body by two spaces. This makes +the code easier to read but does not affect how it operates.


When we call the function, the values we pass to it are assigned to +the arguments, which become variables inside the body of the +function.


Inside the function, we use the return() function to +send back the result. This return() function is optional: R +will automatically return the results of whatever command is executed on +the last line of the function.


R +



[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231

That’s not very informative. Let’s add some more arguments so we can +extract that per year and country.


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

If you’ve been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by +using the source() function:


R +


Ok, so there’s a lot going on in this function now. In plain English, +the function now subsets the provided data by year if the year argument +isn’t empty, then subsets the result by country if the country argument +isn’t empty. Then it calculates the GDP for whatever subset emerges from +the previous two steps. The function then adds the GDP as a new column +to the subsetted data and returns this as the final result. You can see +that the output is much more informative than a vector of numbers.


Let’s take a look at what happens when we specify the year:


R +

+head(calcGDP(gapminder, year=2007))


       country year      pop continent lifeExp  gdpPercap          gdp
+12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
+24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
+36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
+48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
+60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
+72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894

Or for a specific country:


R +

+calcGDP(gapminder, country="Australia")


     country year      pop continent lifeExp gdpPercap          gdp
+61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
+62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
+63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
+64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
+65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
+66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
+67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
+68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
+69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
+70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
+71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Or both:


R +

+calcGDP(gapminder, year=2007, country="Australia")


     country year      pop continent lifeExp gdpPercap          gdp
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Let’s walk through the body of the function:


R +

calcGDP <- function(dat, year=NULL, country=NULL) {

Here we’ve added two arguments, year, and +country. We’ve set default arguments for both as +NULL using the = operator in the function +definition. This means that those arguments will take on those values +unless the user specifies otherwise.


R +

+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }

Here, we check whether each additional argument is set to +null, and whenever they’re not null overwrite +the dataset stored in dat with a subset given by the +non-null argument.


Building these conditionals into the function makes it more flexible +for later. Now, we can use it to calculate the GDP for:

  • The whole dataset;
  • +
  • A single year;
  • +
  • A single country;
  • +
  • A single combination of year and country.
  • +

By using %in% instead, we can also give multiple years +or countries to those arguments.

+ +

Tip: Pass by value +


Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify dat inside the +function we are modifying the copy of the gapminder dataset stored in +dat, not the original variable we gave as the first +argument.


This is called “pass-by-value” and it makes writing code much safer: +you can always be sure that whatever changes you make within the body of +the function, stay inside the body of the function.

+ +

Tip: Function scope +


Another important concept is scoping: any variables (or functions!) +you create or modify inside the body of a function only exist for the +lifetime of the function’s execution. When we call +calcGDP(), the variables dat, gdp +and new only exist inside the body of the function. Even if +we have variables of the same name in our interactive R session, they +are not modified in any way when executing a function.


R +

  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

Finally, we calculated the GDP on our new subset, and created a new +data frame with that column added. This means when we call the function +later we can see the context for the returned GDP values, which is much +better than in our first attempt where we got a vector of numbers.

+ +

Challenge 4 +


Test out your GDP function by calculating the GDP for New Zealand in +1987. How does this differ from New Zealand’s GDP in 1952?

+ +

R +

+  calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand")

GDP for New Zealand in 1987: 65050008703


GDP for New Zealand in 1952: 21058193787

+ +

Challenge 5 +


The paste() function can be used to combine text +together, e.g:


R +

+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+paste(best_practice, collapse=" ")


[1] "Write programs for people not computers"

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:


R +

+fence(text=best_practice, wrapper="***")

Note: the paste() function has an argument +called sep, which specifies the separator between text. The +default is a space: ” “. The default for paste0() is no +space”“.

+ +

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:


R +

+fence <- function(text, wrapper){
+  text <- c(wrapper, text, wrapper)
+  result <- paste(text, collapse = " ")
+  return(result)
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+fence(text=best_practice, wrapper="***")


[1] "*** Write programs for people not computers ***"
+ +

Tip +


R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the R +Language Manual or this chapter from Advanced R Programming by Hadley +Wickham.

+ +

Tip: Testing and documenting +


It’s important to both test functions and document them: +Documentation helps you, and others, understand what the purpose of your +function is, and how to use it, and its important to make sure that your +function actually does what you think.


When you first start out, your workflow will probably look a lot like +this:

  1. Write a function
  2. +
  3. Comment parts of the function to document its behaviour
  4. +
  5. Load in the source file
  6. +
  7. Experiment with it in the console to make sure it behaves as you +expect
  8. +
  9. Make any necessary bug fixes
  10. +
  11. Rinse and repeat.
  12. +

Formal documentation for functions, written in separate +.Rd files, gets turned into the documentation you see in +help files. The roxygen2 +package allows R coders to write documentation alongside the function +code and then process it into the appropriate .Rd files. +You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In +fact, packages are, in essence, bundles of functions with this formal +documentation. Loading your own functions through +source("functions.R") is equivalent to loading someone +else’s functions (or your own one day!) through +library("package").


Formal automated tests can be written using the testthat package.

+ +

Keypoints +

  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/11-writing-data.html b/11-writing-data.html new file mode 100644 index 000000000..0aee86219 --- /dev/null +++ b/11-writing-data.html @@ -0,0 +1,687 @@ + +R for Reproducible Scientific Analysis: Writing Data +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Writing Data


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I save plots and data created in R?
  • +


  • To be able to write out plots and data from R.
  • +

Saving plots +


You have already seen how to save the most recent plot you create in +ggplot2, using the command ggsave. As a +refresher:


R +


You can save a plot from within RStudio using the ‘Export’ button in +the ‘Plot’ window. This will give you the option of saving as a .pdf or +as .png, .jpg or other image formats.


Sometimes you will want to save plots without creating them in the +‘Plot’ window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you’re looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can’t stop +the loop to click ‘Export’ for each one.


In this case you can use a more flexible approach. The function +pdf creates a new pdf device. You can control the size and +resolution using the arguments to this function.


R +

+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
+  geom_line() +
+  theme(legend.position = "none")
+# You then have to make sure to turn off the pdf device!

Open up this document and have a look.

+ +

Challenge 1 +


Rewrite your ‘pdf’ command to print a second page in the pdf, showing +a facet plot (hint: use facet_grid) of the same data with +one panel per continent.

+ +

R +

+pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
+p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
+  geom_line() +
+  theme(legend.position = "none")
+p + facet_grid(~continent)

The commands jpeg, png etc. are used +similarly to produce documents in different formats.


Writing data +


At some point, you’ll also want to write out data from R.


We can use the write.table function for this, which is +very similar to read.table from before.


Let’s create a data-cleaning script, for this analysis, we only want +to focus on the gapminder data for Australia:


R +

+aust_subset <- gapminder[gapminder$country == "Australia",]
+  file="cleaned-data/gapminder-aus.csv",
+  sep=","

Let’s switch back to the shell to take a look at the data to make +sure it looks OK:



head cleaned-data/gapminder-aus.csv



Hmm, that’s not quite what we wanted. Where did all these quotation +marks come from? Also the row numbers are meaningless.


Let’s look at the help file to work out how to change this +behaviour.


R +


By default R will wrap character vectors with quotation marks when +writing out to file. It will also write out the row and column +names.


Let’s fix this:


R +

+  gapminder[gapminder$country == "Australia",],
+  file="cleaned-data/gapminder-aus.csv",
+  sep=",", quote=FALSE, row.names=FALSE

Now lets look at the data again using our shell skills:



head cleaned-data/gapminder-aus.csv



That looks better!

+ +

Challenge 2 +


Write a data-cleaning script file that subsets the gapminder data to +include only data points collected since 1990.


Use this script to write out the new subset to a file in the +cleaned-data/ directory.

+ +

R +

+  gapminder[gapminder$year > 1990, ],
+  file = "cleaned-data/gapminder-after1990.csv",
+  sep = ",", quote = FALSE, row.names = FALSE
+ +

Keypoints +

  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/12-plyr.html b/12-plyr.html new file mode 100644 index 000000000..7b00811df --- /dev/null +++ b/12-plyr.html @@ -0,0 +1,1011 @@ + +R for Reproducible Scientific Analysis: Splitting and Combining Data Frames with plyr +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Splitting and Combining Data Frames with plyr


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I do different calculations on different sets of data?
  • +


  • To be able to use the split-apply-combine strategy for data +analysis.
  • +

Previously we looked at how you can use functions to simplify your +code. We defined the calcGDP function, which takes the +gapminder dataset, and multiplies the population and GDP per capita +column. We also defined additional arguments so we could filter by +year and country:


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

A common task you’ll encounter when working with data, is that you’ll +want to run calculations on different groups within the data. In the +above, we were calculating the GDP by multiplying two columns together. +But what if we wanted to calculated the mean GDP per continent?


We could run calcGDP and then take the mean of each +continent:


R +

+withGDP <- calcGDP(gapminder)
+mean(withGDP[withGDP$continent == "Africa", "gdp"])


[1] 20904782844

R +

+mean(withGDP[withGDP$continent == "Americas", "gdp"])


[1] 379262350210

R +

+mean(withGDP[withGDP$continent == "Asia", "gdp"])


[1] 227233738153

But this isn’t very nice. Yes, by using a function, you have +reduced a substantial amount of repetition. That is +nice. But there is still repetition. Repeating yourself will cost you +time, both now and later, and potentially introduce some nasty bugs.


We could write a new function that is flexible like +calcGDP, but this also takes a substantial amount of effort +and testing to get right.


The abstract problem we’re encountering here is know as +“split-apply-combine”:

Split apply combine

We want to split our data into groups, in this case +continents, apply some calculations on that group, then +optionally combine the results together afterwards.


The plyr package +


For those of you who have used R before, you might be familiar with +the apply family of functions. While R’s built in functions +do work, we’re going to introduce you to another method for solving the +“split-apply-combine” problem. The plyr package provides a set of +functions that we find more user friendly for solving this problem.


We installed this package in an earlier challenge. Let us load it +now:


R +


Plyr has functions for operating on lists, +data.frames and arrays (matrices, or +n-dimensional vectors). Each function performs:

  1. A splitting operation
  2. +
  3. +Apply a function on each split in turn.
  4. +
  5. Recombine output data as a single data object.
  6. +

The functions are named based on the data structure they expect as +input, and the data structure you want returned as output: [a]rray, +[l]ist, or [d]ata.frame. The first letter corresponds to the input data +structure, the second letter to the output data structure, and then the +rest of the function is named “ply”.


This gives us 9 core functions **ply. There are an additional three +functions which will only perform the split and apply steps, and not any +combine step. They’re named by their input data type and represent null +output by a _ (see table)


Note here that plyr’s use of “array” is different to R’s, an array in +ply can include a vector or matrix.

Full apply suite

Each of the xxply functions (daply, ddply, +llply, laply, …) has the same structure and +has 4 key features and structure:


R +

+xxply(.data, .variables, .fun)
  • The first letter of the function name gives the input type and the +second gives the output type.
  • +
  • .data - gives the data object to be processed
  • +
  • .variables - identifies the splitting variables
  • +
  • .fun - gives the function to be called on each piece
  • +

Now we can quickly calculate the mean GDP per continent:


R +

+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)


  continent           V1
+1    Africa  20904782844
+2  Americas 379262350210
+3      Asia 227233738153
+4    Europe 269442085301
+5   Oceania 188187105354

Let us walk through the previous code:

  • The ddply function feeds in a data.frame +(function starts with d) and returns another +data.frame (2nd letter is a d)
  • +
  • the first argument we gave was the data.frame we wanted to operate +on: in this case the gapminder data. We called calcGDP on +it first so that it would have the additional gdp column +added to it.
  • +
  • The second argument indicated our split criteria: in this case the +“continent” column. Note that we gave the name of the column, not the +values of the column like we had done previously with subsetting. Plyr +takes care of these implementation details for you.
  • +
  • The third argument is the function we want to apply to each grouping +of the data. We had to define our own short function here: each subset +of the data gets stored in x, the first argument of our +function. This is an anonymous function: we haven’t defined it +elsewhere, and it has no name. It only exists in the scope of our call +to ddply.
  • +
+ +

Challenge 1 +


Calculate the average life expectancy per continent. Which has the +longest? Which has the shortest?

+ +

R +

+ .data = gapminder,
+ .variables = "continent",
+ .fun = function(x) mean(x$lifeExp)

Oceania has the longest and Africa the shortest.


What if we want a different type of output data structure?:


R +

+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)


+[1] 20904782844
+[1] 379262350210
+[1] 227233738153
+[1] 269442085301
+[1] 188187105354
+[1] "data.frame"
+  continent
+1    Africa
+2  Americas
+3      Asia
+4    Europe
+5   Oceania

We called the same function again, but changed the second letter to +an l, so the output was returned as a list.


We can specify multiple columns to group by:


R +

+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)


   continent year           V1
+1     Africa 1952   5992294608
+2     Africa 1957   7359188796
+3     Africa 1962   8784876958
+4     Africa 1967  11443994101
+5     Africa 1972  15072241974
+6     Africa 1977  18694898732
+7     Africa 1982  22040401045
+8     Africa 1987  24107264108
+9     Africa 1992  26256977719
+10    Africa 1997  30023173824
+11    Africa 2002  35303511424
+12    Africa 2007  45778570846
+13  Americas 1952 117738997171
+14  Americas 1957 140817061264
+15  Americas 1962 169153069442
+16  Americas 1967 217867530844
+17  Americas 1972 268159178814
+18  Americas 1977 324085389022
+19  Americas 1982 363314008350
+20  Americas 1987 439447790357
+21  Americas 1992 489899820623
+22  Americas 1997 582693307146
+23  Americas 2002 661248623419
+24  Americas 2007 776723426068
+25      Asia 1952  34095762661
+26      Asia 1957  47267432088
+27      Asia 1962  60136869012
+28      Asia 1967  84648519224
+29      Asia 1972 124385747313
+30      Asia 1977 159802590186
+31      Asia 1982 194429049919
+32      Asia 1987 241784763369
+33      Asia 1992 307100497486
+34      Asia 1997 387597655323
+35      Asia 2002 458042336179
+36      Asia 2007 627513635079
+37    Europe 1952  84971341466
+38    Europe 1957 109989505140
+39    Europe 1962 138984693095
+40    Europe 1967 173366641137
+41    Europe 1972 218691462733
+42    Europe 1977 255367522034
+43    Europe 1982 279484077072
+44    Europe 1987 316507473546
+45    Europe 1992 342703247405
+46    Europe 1997 383606933833
+47    Europe 2002 436448815097
+48    Europe 2007 493183311052
+49   Oceania 1952  54157223944
+50   Oceania 1957  66826828013
+51   Oceania 1962  82336453245
+52   Oceania 1967 105958863585
+53   Oceania 1972 134112109227
+54   Oceania 1977 154707711162
+55   Oceania 1982 176177151380
+56   Oceania 1987 209451563998
+57   Oceania 1992 236319179826
+58   Oceania 1997 289304255183
+59   Oceania 2002 345236880176
+60   Oceania 2007 403657044512

R +

+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)


+continent          1952         1957         1962         1967         1972
+  Africa     5992294608   7359188796   8784876958  11443994101  15072241974
+  Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+  Asia      34095762661  47267432088  60136869012  84648519224 124385747313
+  Europe    84971341466 109989505140 138984693095 173366641137 218691462733
+  Oceania   54157223944  66826828013  82336453245 105958863585 134112109227
+          year
+continent          1977         1982         1987         1992         1997
+  Africa    18694898732  22040401045  24107264108  26256977719  30023173824
+  Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+  Asia     159802590186 194429049919 241784763369 307100497486 387597655323
+  Europe   255367522034 279484077072 316507473546 342703247405 383606933833
+  Oceania  154707711162 176177151380 209451563998 236319179826 289304255183
+          year
+continent          2002         2007
+  Africa    35303511424  45778570846
+  Americas 661248623419 776723426068
+  Asia     458042336179 627513635079
+  Europe   436448815097 493183311052
+  Oceania  345236880176 403657044512

You can use these functions in place of for loops (and +it is usually faster to do so). To replace a for loop, put the code that +was in the body of the for loop inside an anonymous +function.


R +

+  .data=gapminder,
+  .variables = "continent",
+  .fun = function(x) {
+    meanGDPperCap <- mean(x$gdpPercap)
+    print(paste(
+      "The mean GDP per capita for", unique(x$continent),
+      "is", format(meanGDPperCap, big.mark=",")
+   ))
+  }


[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+ +

Tip: printing numbers +


The format function can be used to make numeric values +“pretty” for printing out in messages.

+ +

Challenge 2 +


Calculate the average life expectancy per continent and year. Which +had the longest and shortest in 2007? Which had the greatest change in +between 1952 and 2007?

+ +

R +

+solution <- ddply(
+ .data = gapminder,
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$lifeExp)
+solution_2007 <- solution[solution$year == 2007, ]

Oceania had the longest average life expectancy in 2007 and Africa +the lowest.


R +

+solution_1952_2007 <- cbind(solution[solution$year == 1952, ], solution_2007)
+difference_1952_2007 <- data.frame(continent = solution_1952_2007$continent,
+                                   year_1957 = solution_1952_2007[[3]],
+                                   year_2007 = solution_1952_2007[[6]],
+                                   difference = solution_1952_2007[[6]] - solution_1952_2007[[3]])

Asia had the greatest difference, and Oceania the least.

+ +

Alternate Challenge +


Without running them, which of the following will calculate the +average life expectancy per continent:

  1. +

R +

+  .data = gapminder,
+  .variables = gapminder$continent,
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = mean(dataGroup$lifeExp)
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+ +

Answer 3 will calculate the average life expectancy per +continent.

+ +

Keypoints +

  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/13-dplyr.html b/13-dplyr.html new file mode 100644 index 000000000..cdef25e5a --- /dev/null +++ b/13-dplyr.html @@ -0,0 +1,1239 @@ + +R for Reproducible Scientific Analysis: Data Frame Manipulation with dplyr +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Data Frame Manipulation with dplyr


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I manipulate data frames without repeating myself?
  • +


  • To be able to use the six main data frame manipulation ‘verbs’ with +pipes in dplyr.
  • +
  • To understand how group_by() and +summarize() can be combined to summarize datasets.
  • +
  • Be able to analyze a subset of data using logical filtering.
  • +

Manipulation of data frames means many things to many researchers: we +often select certain observations (rows) or variables (columns), we +often group the data by a certain variable(s), or we even calculate +summary statistics. We can do these operations using the normal base R +operations:


R +

+mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])


[1] 2193.755

R +

+mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])


[1] 7136.11

R +

+mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])


[1] 7902.15

But this isn’t very nice because there is a fair bit of +repetition. Repeating yourself will cost you time, both now and later, +and potentially introduce some nasty bugs.


The dplyr package +


Luckily, the dplyr +package provides a number of very useful functions for manipulating data +frames in a way that will reduce the above repetition, reduce the +probability of making errors, and probably even save you some typing. As +an added bonus, you might even find the dplyr grammar +easier to read.

+ +

Tip: Tidyverse +


dplyr package belongs to a broader family of opinionated +R packages designed for data science called the “Tidyverse”. These +packages are specifically designed to work harmoniously together. Some +of these packages will be covered along this course, but you can find +more complete information here: https://www.tidyverse.org/.


Here we’re going to cover 5 of the most commonly used functions as +well as using pipes (%>%) to combine them.

  1. select()
  2. +
  3. filter()
  4. +
  5. group_by()
  6. +
  7. summarize()
  8. +
  9. mutate()
  10. +

If you have have not installed this package earlier, please do +so:


R +


Now let’s load the package:


R +


Using select() +


If, for example, we wanted to move forward with only a few of the +variables in our data frame we could use the select() +function. This will keep only the variables you select.


R +

+year_country_gdp <- select(gapminder, year, country, gdpPercap)

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.


R +

+smaller_gapminder_data <- select(gapminder, -continent)

If we open up year_country_gdp we’ll see that it only +contains the year, country and gdpPercap. Above we used ‘normal’ +grammar, but the strengths of dplyr lie in combining +several functions using pipes. Since the pipes grammar is unlike +anything we’ve seen in R before, let’s repeat what we’ve done above +using pipes.


R +

+year_country_gdp <- gapminder %>% select(year, country, gdpPercap)

To help you understand why we wrote that in that way, let’s walk +through it step by step. First we summon the gapminder data frame and +pass it on, using the pipe symbol %>%, to the next step, +which is the select() function. In this case we don’t +specify which data object we use in the select() function +since in gets that from the previous pipe. Fun Fact: +There is a good chance you have encountered pipes before in the shell. +In R, a pipe symbol is %>% while in the shell it is +| but the concept is the same!

+ +

Tip: Renaming data frame columns in dplyr +


In Chapter 4 we covered how you can rename columns with base R by +assigning a value to the output of the names() function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a +rename() function.


Within a pipeline, the syntax is +rename(new_name = old_name). For example, we may want to +rename the gdpPercap column name from our select() +statement above.


R +

+tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)


  year     country gdp_per_capita
+1 1952 Afghanistan       779.4453
+2 1957 Afghanistan       820.8530
+3 1962 Afghanistan       853.1007
+4 1967 Afghanistan       836.1971
+5 1972 Afghanistan       739.9811
+6 1977 Afghanistan       786.1134

Using filter() +


If we now want to move forward with the above, but only with European +countries, we can combine select and +filter


R +

+year_country_gdp_euro <- gapminder %>%
+    filter(continent == "Europe") %>%
+    select(year, country, gdpPercap)

If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below.


R +

+europe_lifeExp_2007 <- gapminder %>%
+  filter(continent == "Europe", year == 2007) %>%
+  select(country, lifeExp)
+ +

Challenge 1 +


Write a single command (which can span multiple lines and includes +pipes) that will produce a data frame that has the African values for +lifeExp, country and year, but +not for other Continents. How many rows does your data frame have and +why?

+ +

R +

+year_country_lifeExp_Africa <- gapminder %>%
+                           filter(continent == "Africa") %>%
+                           select(year, country, lifeExp)

As with last time, first we pass the gapminder data frame to the +filter() function, then we pass the filtered version of the +gapminder data frame to the select() function. +Note: The order of operations is very important in this +case. If we used ‘select’ first, filter would not be able to find the +variable continent since we would have removed it in the previous +step.


Using group_by() +


Now, we were supposed to be reducing the error prone repetitiveness +of what can be done with base R, but up to now we haven’t done that +since we would have to repeat the above for each continent. Instead of +filter(), which will only pass observations that meet your +criteria (in the above: continent=="Europe"), we can use +group_by(), which will essentially use every unique +criteria that you could have used in filter.


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...

R +

+str(gapminder %>% group_by(continent))


gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
+  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
+  ..$ .rows    : list<int> [1:5] 
+  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
+  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
+  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
+  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
+  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
+  .. ..@ ptype: int(0) 
+  ..- attr(*, ".drop")= logi TRUE

You will notice that the structure of the data frame where we used +group_by() (grouped_df) is not the same as the +original gapminder (data.frame). A +grouped_df can be thought of as a list where +each item in the listis a data.frame which +contains only the rows that correspond to the a particular value +continent (at least in the example above).

Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize() +


The above was a bit on the uneventful side but +group_by() is much more exciting in conjunction with +summarize(). This will allow us to create new variable(s) +by using functions that repeat for each of the continent-specific data +frames. That is to say, using the group_by() function, we +split our original data frame into multiple pieces, then we can run +functions (e.g. mean() or sd()) within +summarize().


R +

+gdp_bycontinents <- gapminder %>%
+    group_by(continent) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
Diagram illustrating the use of group by and summarize together to create a new variable

R +

continent mean_gdpPercap
+     <fctr>          <dbl>
+1    Africa       2193.755
+2  Americas       7136.110
+3      Asia       7902.150
+4    Europe      14469.476
+5   Oceania      18621.609

That allowed us to calculate the mean gdpPercap for each continent, +but it gets even better.

+ +

Challenge 2 +


Calculate the average life expectancy per country. Which has the +longest average life expectancy and which has the shortest average life +expectancy?

+ +

R +

+lifeExp_bycountry <- gapminder %>%
+   group_by(country) %>%
+   summarize(mean_lifeExp = mean(lifeExp))
+lifeExp_bycountry %>%
+   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))


# A tibble: 2 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Iceland              76.5
+2 Sierra Leone         36.8

Another way to do this is to use the dplyr function +arrange(), which arranges the rows in a data frame +according to the order of one or more variables from the data frame. It +has similar syntax to other functions from the dplyr +package. You can use desc() inside arrange() +to sort in descending order.


R +

+lifeExp_bycountry %>%
+   arrange(mean_lifeExp) %>%
+   head(1)


# A tibble: 1 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Sierra Leone         36.8

R +

+lifeExp_bycountry %>%
+   arrange(desc(mean_lifeExp)) %>%
+   head(1)


# A tibble: 1 × 2
+  country mean_lifeExp
+  <chr>          <dbl>
+1 Iceland         76.5

Alphabetical order works too


R +

+lifeExp_bycountry %>%
+   arrange(desc(country)) %>%
+   head(1)


# A tibble: 1 × 2
+  country  mean_lifeExp
+  <chr>           <dbl>
+1 Zimbabwe         52.7

The function group_by() allows us to group by multiple +variables. Let’s group by year and +continent.


R +

+gdp_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

That is already quite powerful, but it gets even better! You’re not +limited to defining 1 new variable in summarize().


R +

+gdp_pop_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

count() and n() +


A very common operation is to count the number of observations for +each group. The dplyr package comes with two related +functions that help with this.


For instance, if we wanted to check the number of countries included +in the dataset for the year 2002, we can use the count() +function. It takes the name of one or more columns that contain the +groups we are interested in, and we can optionally sort the results in +descending order by adding sort=TRUE:


R +

+gapminder %>%
+    filter(year == 2002) %>%
+    count(continent, sort = TRUE)


  continent  n
+1    Africa 52
+2      Asia 33
+3    Europe 30
+4  Americas 25
+5   Oceania  2

If we need to use the number of observations in calculations, the +n() function is useful. It will return the total number of +observations in the current group rather than counting the number of +observations in each group within a specific column. For instance, if we +wanted to get the standard error of the life expectency per +continent:


R +

+gapminder %>%
+    group_by(continent) %>%
+    summarize(se_le = sd(lifeExp)/sqrt(n()))


# A tibble: 5 × 2
+  continent se_le
+  <chr>     <dbl>
+1 Africa    0.366
+2 Americas  0.540
+3 Asia      0.596
+4 Europe    0.286
+5 Oceania   0.775

You can also chain together several summary operations; in this case +calculating the minimum, maximum, +mean and se of each continent’s per-country +life-expectancy:


R +

+gapminder %>%
+    group_by(continent) %>%
+    summarize(
+      mean_le = mean(lifeExp),
+      min_le = min(lifeExp),
+      max_le = max(lifeExp),
+      se_le = sd(lifeExp)/sqrt(n()))


# A tibble: 5 × 5
+  continent mean_le min_le max_le se_le
+  <chr>       <dbl>  <dbl>  <dbl> <dbl>
+1 Africa       48.9   23.6   76.4 0.366
+2 Americas     64.7   37.6   80.7 0.540
+3 Asia         60.1   28.8   82.6 0.596
+4 Europe       71.9   43.6   81.8 0.286
+5 Oceania      74.3   69.1   81.2 0.775

Using mutate() +


We can also create new variables prior to (or even after) summarizing +information using mutate().


R +

+gdp_pop_bycontinents_byyear <- gapminder %>%
+    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
+    group_by(continent,year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

Connect mutate with logical filtering: ifelse +


When creating new variables, we can hook this with a logical +condition. A simple combination of mutate() and +ifelse() facilitates filtering right where it is needed: in +the moment of creating something new. This easy-to-read statement is a +fast and powerful way of discarding certain data (even though the +overall dimension of the data frame will not change) or for updating +values depending on this given condition.


R +

+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
+    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

R +

+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
+    mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              mean_gdpPercap_expected = mean(gdp_futureExpectation))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

Combining dplyr and ggplot2 +


First install and load ggplot2:


R +


R +


In the plotting lesson we looked at how to make a multi-panel figure +by adding a layer of facet panels using ggplot2. Here is +the code we used (with some extra comments):


R +

+# Filter countries located in the Americas
+americas <- gapminder[gapminder$continent == "Americas", ]
+# Make the plot
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

This code makes the right plot but it also creates an intermediate +variable (americas) that we might not have any other uses +for. Just as we used %>% to pipe data along a chain of +dplyr functions we can use it to pass data to +ggplot(). Because %>% replaces the first +argument in a function we don’t need to specify the data = +argument in the ggplot() function. By combining +dplyr and ggplot2 functions we can make the +same figure without creating any new variables or modifying the +data.


R +

+gapminder %>%
+  # Filter countries located in the Americas
+  filter(continent == "Americas") %>%
+  # Make the plot
+  ggplot(mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

More examples of using the function mutate() and the +ggplot2 package.


R +

+gapminder %>%
+  # extract first letter of country name into new column
+  mutate(startsWith = substr(country, 1, 1)) %>%
+  # only keep countries starting with A or Z
+  filter(startsWith %in% c("A", "Z")) %>%
+  # plot lifeExp into facets
+  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
+  geom_line() +
+  facet_wrap(vars(country)) +
+  theme_minimal()
+ +

Advanced Challenge +


Calculate the average life expectancy in 2002 of 2 randomly selected +countries for each continent. Then arrange the continent names in +reverse order. Hint: Use the dplyr +functions arrange() and sample_n(), they have +similar syntax to other dplyr functions.

+ +

R +

+lifeExp_2countries_bycontinents <- gapminder %>%
+   filter(year==2002) %>%
+   group_by(continent) %>%
+   sample_n(2) %>%
+   summarize(mean_lifeExp=mean(lifeExp)) %>%
+   arrange(desc(mean_lifeExp))

Other great resources +

+ +

Keypoints +

  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/14-tidyr.html b/14-tidyr.html new file mode 100644 index 000000000..74127b3b2 --- /dev/null +++ b/14-tidyr.html @@ -0,0 +1,1160 @@ + +R for Reproducible Scientific Analysis: Data Frame Manipulation with tidyr +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Data Frame Manipulation with tidyr


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I change the layout of a data frame?
  • +


  • To understand the concepts of ‘longer’ and ‘wider’ data frame +formats and be able to convert between them with +tidyr.
  • +

Researchers often want to reshape their data frames from ‘wide’ to +‘longer’ layouts, or vice-versa. The ‘long’ layout or format is +where:

  • each column is a variable
  • +
  • each row is an observation
  • +

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column +for the observed variable and the other columns are ID variables.


For the ‘wide’ format each row is often a site/subject/patient and +you have multiple observation variables containing the same type of +data. These can be either repeated observations over time, or +observation of multiple variables (or a mix of both). You may find data +input may be simpler or some other applications may prefer the ‘wide’ +format. However, many of R‘s functions have been designed +assuming you have ’longer’ formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format.

Diagram illustrating the difference between a wide versus long layout of a data frame

Long and wide data frame layouts mainly affect readability. For +humans, the wide format is often more intuitive since we can often see +more of the data on the screen due to its shape. However, the long +format is more machine readable and is closer to the formatting of +databases. The ID variables in our data frames are similar to the fields +in a database and observed variables are like the database values.


Getting started +


First install the packages if you haven’t already done so (you +probably installed dplyr in the previous lesson):


R +


Load the packages


R +


First, lets look at the structure of our original gapminder data +frame:


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+ +

Challenge 1 +


Is gapminder a purely long, purely wide, or some intermediate +format?

+ +

The original gapminder data.frame is in an intermediate format. It is +not purely long since it had multiple observation variables +(pop,lifeExp,gdpPercap).


Sometimes, as with the gapminder dataset, we have multiple types of +observed data. It is somewhere in between the purely ‘long’ and ‘wide’ +data formats. We have 3 “ID variables” (continent, +country, year) and 3 “Observation variables” +(pop,lifeExp,gdpPercap). This +intermediate format can be preferred despite not having ALL observations +in 1 column given that all 3 observation variables have different units. +There are few operations that would need us to make this data frame any +longer (i.e. 4 ID variables and 1 Observation variable).


While using many of the functions in R, which are often vector based, +you usually do not want to do mathematical operations on values with +different units. For example, using the purely long format, a single +mean for all of the values of population, life expectancy, and GDP would +not be meaningful since it would return the mean of values with 3 +incompatible units. The solution is that we first manipulate the data +either by grouping (see the lesson on dplyr), or we change +the structure of the data frame. Note: Some plotting +functions in R actually work better in the wide format data.


From wide to long format with pivot_longer() +


Until now, we’ve been using the nicely formatted original gapminder +dataset, but ‘real’ data (i.e. our own research data) will never be so +well organized. Here let’s start with the wide formatted version of the +gapminder dataset.


Download the wide version of the gapminder data from here and save it in your data +folder.


We’ll load the data file and look at it. Note: we don’t want our +continent and country columns to be factors, so we use the +stringsAsFactors argument for read.csv() to disable +that.


R +

+gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)


'data.frame':	142 obs. of  38 variables:
+ $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num  3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num  2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
+ $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
Diagram illustrating the wide format of the gapminder data frame

To change this very wide data frame layout back to our nice, +intermediate (or longer) layout, we will use one of the two available +pivot functions from the tidyr package. To +convert from wide to a longer format, we will use the +pivot_longer() function. pivot_longer() makes +datasets longer by increasing the number of rows and decreasing the +number of columns, or ‘lengthening’ your observation variables into a +single variable.

Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format

R +

+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+    names_to = "obstype_year", values_to = "obs_values"
+  )


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
+ $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...

Here we have used piping syntax which is similar to what we were +doing in the previous lesson with dplyr. In fact, these are compatible +and you can use a mix of tidyr and dplyr functions by piping them +together.


We first provide to pivot_longer() a vector of column +names that will be pivoted into longer format. We could type out all the +observation variables, but as in the select() function (see +dplyr lesson), we can use the starts_with() +argument to select all variables that start with the desired character +string. pivot_longer() also allows the alternative syntax +of using the - symbol to identify which variables are not +to be pivoted (i.e. ID variables).


The next arguments to pivot_longer() are +names_to for naming the column that will contain the new ID +variable (obstype_year) and values_to for +naming the new amalgamated observation variable +(obs_value). We supply these new column names as +strings.

Diagram illustrating the long format of the gapminder data

R +

+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(-continent, -country),
+    names_to = "obstype_year", values_to = "obs_values"
+  )


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values  : num [1:5112] 2449 3014 2551 3247 4183 ...

That may seem trivial with this particular data frame, but sometimes +you have 1 ID variable and 40 observation variables with irregular +variable names. The flexibility is a huge time saver!


Now obstype_year actually contains 2 pieces of +information, the observation type +(pop,lifeExp, or gdpPercap) and +the year. We can use the separate() function +to split the character strings into multiple variables


R +

+gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
+gap_long$year <- as.integer(gap_long$year)
+ +

Challenge 2 +


Using gap_long, calculate the mean life expectancy, +population, and gdpPercap for each continent. Hint: use +the group_by() and summarize() functions we +learned in the dplyr lesson

+ +

R +

+gap_long %>% group_by(continent, obs_type) %>%
+   summarize(means=mean(obs_values))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.


# A tibble: 15 × 3
+# Groups:   continent [5]
+   continent obs_type       means
+   <chr>     <chr>          <dbl>
+ 1 Africa    gdpPercap     2194. 
+ 2 Africa    lifeExp         48.9
+ 3 Africa    pop        9916003. 
+ 4 Americas  gdpPercap     7136. 
+ 5 Americas  lifeExp         64.7
+ 6 Americas  pop       24504795. 
+ 7 Asia      gdpPercap     7902. 
+ 8 Asia      lifeExp         60.1
+ 9 Asia      pop       77038722. 
+10 Europe    gdpPercap    14469. 
+11 Europe    lifeExp         71.9
+12 Europe    pop       17169765. 
+13 Oceania   gdpPercap    18622. 
+14 Oceania   lifeExp         74.3
+15 Oceania   pop        8874672. 

From long to intermediate format with pivot_wider() +


It is always good to check work. So, let’s use the second +pivot function, pivot_wider(), to ‘widen’ our +observation variables back out. pivot_wider() is the +opposite of pivot_longer(), making a dataset wider by +increasing the number of columns and decreasing the number of rows. We +can use pivot_wider() to pivot or reshape our +gap_long to the original intermediate format or the widest +format. Let’s start with the intermediate format.


The pivot_wider() function takes names_from +and values_from arguments.


To names_from we supply the column name whose contents +will be pivoted into new output columns in the widened data frame. The +corresponding values will be added from the column named in the +values_from argument.


R +

+gap_normal <- gap_long %>%
+  pivot_wider(names_from = obs_type, values_from = obs_values)


[1] 1704    6

R +



[1] 1704    6

R +



[1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"      

R +



[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

Now we’ve got an intermediate data frame gap_normal with +the same dimensions as the original gapminder, but the +order of the variables is different. Let’s fix that before checking if +they are all.equal().


R +

+gap_normal <- gap_normal[, names(gapminder)]
+all.equal(gap_normal, gapminder)


[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+[3] "Component \"country\": 1704 string mismatches"                                         
+[4] "Component \"pop\": Mean relative difference: 1.634504"                                 
+[5] "Component \"continent\": 1212 string mismatches"                                       
+[6] "Component \"lifeExp\": Mean relative difference: 0.203822"                             
+[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"                           

R +



# A tibble: 6 × 6
+  country  year      pop continent lifeExp gdpPercap
+  <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>
+1 Algeria  1952  9279525 Africa       43.1     2449.
+2 Algeria  1957 10270856 Africa       45.7     3014.
+3 Algeria  1962 11000948 Africa       48.3     2551.
+4 Algeria  1967 12760499 Africa       51.4     3247.
+5 Algeria  1972 14760787 Africa       54.5     4183.
+6 Algeria  1977 17152804 Africa       58.0     4910.

R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134

We’re almost there, the original was sorted by country, +then year.


R +

+gap_normal <- gap_normal %>% arrange(country, year)
+all.equal(gap_normal, gapminder)


[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

That’s great! We’ve gone from the longest format back to the +intermediate and we didn’t introduce any errors in our code.


Now let’s convert the long all the way back to the wide. In the wide +format, we will keep country and continent as ID variables and pivot the +observations across the 3 metrics +(pop,lifeExp,gdpPercap) and time +(year). First we need to create appropriate labels for all +our new variables (time*metric combinations) and we also need to unify +our ID variables to simplify the process of defining +gap_wide.


R +

+gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ var_ID    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ obs_type  : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
+ $ year      : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

R +

+gap_temp <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_")


tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
+ $ ID_var    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

Using unite() we now have a single ID variable which is +a combination of continent,country,and we have +defined variable names. We’re now ready to pipe in +pivot_wider()


R +

+gap_wide_new <- gap_long %>%
+  unite(ID_var, continent, country, sep = "_") %>%
+  unite(var_names, obs_type, year, sep = "_") %>%
+  pivot_wider(names_from = var_names, values_from = obs_values)


tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
+ $ ID_var        : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+ +

Challenge 3 +


Take this 1 step further and create a +gap_ludicrously_wide format data by pivoting over +countries, year and the 3 metrics? Hint this new data +frame should only have 5 rows.

+ +

R +

+gap_ludicrously_wide <- gap_long %>%
+   unite(var_names, obs_type, year, country, sep = "_") %>%
+   pivot_wider(names_from = var_names, values_from = obs_values)

Now we have a great ‘wide’ format data frame, but the +ID_var could be more usable, let’s separate it into 2 +variables with separate()


R +

+gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
+gap_wide_betterID <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_") %>%
+    pivot_wider(names_from = var_names, values_from = obs_values) %>%
+    separate(ID_var, c("continent","country"), sep = "_")


tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
+ $ continent     : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...

R +

+all.equal(gap_wide, gap_wide_betterID)


[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

There and back again!


Other great resources +

+ +

Keypoints +

  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/15-knitr-markdown.html b/15-knitr-markdown.html new file mode 100644 index 000000000..b8c0f399d --- /dev/null +++ b/15-knitr-markdown.html @@ -0,0 +1,939 @@ + +R for Reproducible Scientific Analysis: Producing Reports With knitr +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Producing Reports With knitr


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I integrate software and reports?
  • +


  • Understand the value of writing reproducible reports
  • +
  • Learn how to recognise and compile the basic components of an R +Markdown file
  • +
  • Become familiar with R code chunks, and understand their purpose, +structure and options
  • +
  • Demonstrate the use of inline chunks for weaving R outputs into text +blocks, for example when discussing the results of some +calculations
  • +
  • Be aware of alternative output formats to which an R Markdown file +can be exported
  • +

Data analysis reports +


Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their work +for future reference.


Many new users begin by first writing a single R script containing +all of their work, and then share the analysis by emailing the script +and various graphs as attachments. But this can be cumbersome, requiring +a lengthy discussion to explain which attachment was which result.


Writing formal reports with Word or LaTeX can simplify this +process by incorporating both the analysis report and output graphs into +a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy +“whack-a-mole” game of fixing new mistakes resulting from a single +formatting change.


Creating a report as a web page (which is an html file) using R +Markdown makes things easier. The report can be one long stream, so tall +figures that wouldn’t ordinarily fit on one page can be kept at full +size and easier to read, since the reader can simply keep scrolling. +Additionally, the formatting of and R Markdown document is simple and +easy to modify, allowing you to spend more time on your analyses instead +of writing reports.


Literate programming +


Ideally, such analysis reports are reproducible documents: +If an error is discovered, or if some additional subjects are added to +the data, you can just re-compile the report and get the new or +corrected results rather than having to reconstruct figures, paste them +into a Word document, and hand-edit various detailed results.


The key R package here is knitr. It allows you +to create a document that is a mixture of text and chunks of code. When +the document is processed by knitr, chunks of code will be +executed, and graphs or other results will be inserted into the final +document.


This sort of idea has been called “literate programming”.


knitr allows you to mix basically any type of text with +code from different programming languages, but we recommend that you use +R Markdown, which mixes Markdown with R. Markdown is a light-weight +mark-up language for creating web pages.


Creating an R Markdown file +


Within RStudio, click File → New File → R Markdown and you’ll get a +dialog box like this:

Screenshot of the New R Markdown file dialogue box in RStudio

You can stick with the default (HTML output), but give it a +title.


Basic components of R Markdown +


The initial chunk of text (header) contains instructions for R to +specify what kind of document will be created, and the options chosen. +You can use the header to give your document a title, author, date, and +tell it what type of output you want to produce. In this case, we’re +creating an html document.

+title: "Initial R Markdown document"
+author: "Karl Broman"
+date: "April 23, 2015"
+output: html_document

You can delete any of those fields if you don’t want them included. +The double-quotes aren’t strictly necessary in this case. +They’re mostly needed if you want to include a colon in the title.


RStudio creates the document with some example text to get you +started. Note below that there are chunks like


These are chunks of R code that will be executed by +knitr and replaced by their results. More on this +later.


Markdown +


Markdown is a system for writing web pages by marking up the text +much as you would in an email rather than writing html code. The +marked-up text gets converted to html, replacing the marks with +the proper html code.


For now, let’s delete all of the stuff that’s there and write a bit +of markdown.


You make things bold using two asterisks, like this: +**bold**, and you make things italics by using +underscores, like this: _italics_.


You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this:

A list:
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks

or like this:

A second list:
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks

Each will appear as:

  • bold with double-asterisks
  • +
  • italics with underscores
  • +
  • code-type font with backticks
  • +

You can use whatever method you prefer, but be consistent. +This maintains the readability of your code.


You can make a numbered list by just using numbers. You can even use +the same number over and over if you want:

1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks

This will appear as:

  1. bold with double-asterisks
  2. +
  3. italics with underscores
  4. +
  5. code-type font with backticks
  6. +

You can make section headers of different sizes by initiating a line +with some number of # symbols:

# Title
+## Main section
+### Sub-section
+#### Sub-sub section

You compile the R Markdown document to an html webpage by +clicking the “Knit” button in the upper-left.

+ +

Challenge 1 +


Create a new R Markdown document. Delete all of the R code chunks and +write a bit of Markdown (some sections, some italicized text, and an +itemized list).


Convert the document to a webpage.

+ +

In RStudio, select File > New file > R Markdown…


Delete the placeholder text and add the following:

# Introduction
+## Background on Data
+This report uses the *gapminder* dataset, which has columns that include:
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+## Background on Methods

Then click the ‘Knit’ button on the toolbar to generate an html +document (webpage).


A bit more Markdown +


You can make a hyperlink like this: +[Carpentries Home Page](https://carpentries.org/).


You can include an image file like this: +![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)


You can do subscripts (e.g., F2) with F~2~ +and superscripts (e.g., F2) with F^2^.


If you know how to write equations in LaTeX, you can use +$ $ and $$ $$ to insert math equations, like +$E = mc^2$ and

$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$

You can review Markdown syntax by navigating to the “Markdown Quick +Reference” under the “Help” field in the toolbar at the top of +RStudio.


R code chunks +


The real power of Markdown comes from mixing markdown with chunks of +code. This is R Markdown. When processed, the R code will be executed; +if they produce figures, the figures will be inserted in the final +document.


The main code chunks look like this:

+```{r load_data}

That is, you place a chunk of R code between ```{r +chunk_name} and ```. You should give each chunk a +unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the +shortcuts Ctrl+Alt+I on Windows and +Linux, or Cmd+Option+I on Mac.

+ +

Challenge 2 +


Add code chunks to:

  • Load the ggplot2 package
  • +
  • Read the gapminder data
  • +
  • Create a plot
  • +
+ +
+```{r load-ggplot2}
+```{r read-gapminder-data}
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)

How things get compiled +


When you press the “Knit” button, the R Markdown document is +processed by knitr +and a plain Markdown document is produced (as well as, potentially, a +set of figure files): the R code is executed and replaced by both the +input and the output; if figures are produced, links to those figures +are included.


The Markdown and figure documents are then processed by the tool pandoc, which converts the +Markdown file into an html file, with the figures embedded.


Chunk options +


There are a variety of options to affect how the code chunks are +treated. Here are some examples:

  • Use echo=FALSE to avoid having the code itself +shown.
  • +
  • Use results="hide" to avoid having any results +printed.
  • +
  • Use eval=FALSE to have the code shown but not +evaluated.
  • +
  • Use warning=FALSE and message=FALSE to +hide any warnings or messages produced.
  • +
  • Use fig.height and fig.width to control +the size of the figures produced (in inches).
  • +

So you might write:

+```{r load_libraries, echo=FALSE, message=FALSE}

Often there will be particular options that you’ll want to use +repeatedly; for this, you can set global chunk options, like +so:

+```{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)

The fig.path option defines where the figures will be +saved. The / here is really important; without it, the +figures would be saved in the standard place but just with names that +begin with Figs.


If you have multiple R Markdown files in a common directory, you +might want to use fig.path to define separate prefixes for +the figure file names, like fig.path="Figs/cleaning-" and +fig.path="Figs/analysis-".

+ +

Challenge 3 +


Use chunk options to control the size of a figure and to hide the +code.

+ +
+```{r echo = FALSE, fig.width = 3}

You can review all of the R chunk options by navigating +to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the +“Help” field in the toolbar at the top of RStudio.


Inline R code +


You can make every number in your report reproducible. Use +`r and ` for an in-line code chunk, like so: +`r round(some_value, 2)`. The code will be executed and +replaced with the value of the result.


Don’t let these in-line chunks get split across lines.


Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with include=FALSE for +that larger chunk (which is the same as echo=FALSE and +results="hide").


Rounding can produce differences in output in such situations. You +may want 2.0, but round(2.03, 1) will give +just 2.


The myround +function in the R/broman +package handles this.

+ +

Challenge 4 +


Try out a bit of in-line R code.

+ +

Here’s some inline code to determine that 2 + 2 = 4.


Other output options +


You can also convert R Markdown to a PDF or a Word document. Click +the little triangle next to the “Knit” button to get a drop-down menu. +Or you could put pdf_document or word_document +in the initial header of the file.

+ +

Tip: Creating PDF documents +


Creating .pdf documents may require installation of some extra +software. The R package tinytex provides some tools to help +make this process easier for R users. With tinytex +installed, run tinytex::install_tinytex() to install the +required software (you’ll only need to do this once) and then when you +knit to pdf tinytex will automatically detect and install +any additional LaTeX packages that are needed to produce the pdf +document. Visit the tinytex +website for more information.

+ +

Tip: Visual markdown editing in RStudio +


RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.


Resources +

+ +

Keypoints +

  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/16-wrap-up.html b/16-wrap-up.html new file mode 100644 index 000000000..9bed07855 --- /dev/null +++ b/16-wrap-up.html @@ -0,0 +1,587 @@ + +R for Reproducible Scientific Analysis: Writing Good Software +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Writing Good Software


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +




  • How can I write software that other people can use?
  • +


  • Describe best practices for writing R and explain the justification +for each.
  • +

Structure your project folder +


Keep your project folder structured, organized and tidy, by creating +subfolders for your code files, manuals, data, binaries, output plots, +etc. It can be done completely manually, or with the help of RStudio’s +New Project functionality, or a designated package, such as +ProjectTemplate.

+ +

Tip: ProjectTemplate - a possible +solution +


One way to automate the management of projects is to install the +third-party package, ProjectTemplate. This package will set +up an ideal directory structure for project management. This is very +useful as it enables you to have your analysis pipeline/workflow +organised and structured. Together with the default RStudio project +functionality and Git you will be able to keep track of your work as +well as be able to share your work with collaborators.

  1. Install ProjectTemplate.
  2. +
  3. Load the library
  4. +
  5. Initialise the project:
  6. +

R +

+create.project("../my_project_2", merge.strategy = "allow.non.conflict")

For more information on ProjectTemplate and its functionality visit +the home page ProjectTemplate


Make code readable +


The most important part of writing code is making it readable and +understandable. You want someone else to be able to pick up your code +and be able to understand what it does: more often than not this someone +will be you 6 months down the line, who will otherwise be cursing +past-self.


Documentation: tell us what and why, not how +


When you first start out, your comments will often describe what a +command does, since you’re still learning yourself and it can help to +clarify concepts and remind you later. However, these comments aren’t +particularly useful later on when you don’t remember what problem your +code is trying to solve. Try to also include comments that tell you +why you’re solving a problem, and what problem that +is. The how can come after that: it’s an implementation detail +you ideally shouldn’t have to worry about.


Keep your code modular +


Our recommendation is that you should separate your functions from +your analysis scripts, and store them in a separate file that you +source when you open the R session in your project. This +approach is nice because it leaves you with an uncluttered analysis +script, and a repository of useful functions that can be loaded into any +analysis script in your project. It also lets you group related +functions together easily.


Break down problem into bite size pieces +


When you first start out, problem solving and function writing can be +daunting tasks, and hard to separate from code inexperience. Try to +break down your problem into digestible chunks and worry about the +implementation details later: keep breaking down the problem into +smaller and smaller functions until you reach a point where you can code +a solution, and build back up from there.


Know that your code is doing the right thing +


Make sure to test your functions!


Don’t repeat yourself +


Functions enable easy reuse within a project. If you see blocks of +similar lines of code through your project, those are usually candidates +for being moved into functions.


If your calculations are performed through a series of functions, +then the project becomes more modular and easier to change. This is +especially the case for which a particular input always gives a +particular output.


Remember to be stylish +


Apply consistent style to your code.

+ +

Keypoints +

  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+ + +
+ +
Back To Top +
+ + diff --git a/404.html b/404.html new file mode 100644 index 000000000..2c0bde5ad --- /dev/null +++ b/404.html @@ -0,0 +1,451 @@ + +R for Reproducible Scientific Analysis: Page not found +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Page not found

+ +

Our apologies! +


We cannot seem to find the page you are looking for. Here are some +tips that may help:

  1. try going back to the previous +page or
  2. +
  3. navigate to any other page using the navigation bar on the +left.
  4. +
  5. if the URL ends with /index.html, try removing +that.
  6. +
  7. head over to the home page of this +lesson +
  8. +

If you came here from a link in this lesson, please contact the +lesson maintainers using the links at the foot of this page.

+ + +
+ +
Back To Top +
+ + diff --git a/CODE_OF_CONDUCT.html b/CODE_OF_CONDUCT.html new file mode 100644 index 000000000..f2d43ce19 --- /dev/null +++ b/CODE_OF_CONDUCT.html @@ -0,0 +1,450 @@ + +R for Reproducible Scientific Analysis: Contributor Code of Conduct +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Contributor Code of Conduct


Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +

As contributors and maintainers of this project, we pledge to follow +the The +Carpentries Code of Conduct.


Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our reporting +guidelines.

+ + + +
+ + +
+ +
Back To Top +
+ + diff --git a/LICENSE.html b/LICENSE.html new file mode 100644 index 000000000..fd0be828c --- /dev/null +++ b/LICENSE.html @@ -0,0 +1,501 @@ + +R for Reproducible Scientific Analysis: Licenses +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +



Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + + +

Instructional Material +


All Carpentries (Software Carpentry, Data Carpentry, and Library +Carpentry) instructional material is made available under the Creative Commons +Attribution license. The following is a human-readable summary of +(and not a substitute for) the full legal +text of the CC BY 4.0 license.


You are free:

  • to Share—copy and redistribute the material in any +medium or format
  • +
  • to Adapt—remix, transform, and build upon the +material
  • +

for any purpose, even commercially.


The licensor cannot revoke these freedoms as long as you follow the +license terms.


Under the following terms:

  • Attribution—You must give appropriate credit +(mentioning that your work is derived from work that is Copyright (c) +The Carpentries and, where practical, linking to https://carpentries.org/), provide a link to the +license, and indicate if changes were made. You may do so in any +reasonable manner, but not in any way that suggests the licensor +endorses you or your use.

  • +
  • No additional restrictions—You may not apply +legal terms or technological measures that legally restrict others from +doing anything the license permits. With the understanding +that:

  • +


  • You do not have to comply with the license for elements of the +material in the public domain or where your use is permitted by an +applicable exception or limitation.
  • +
  • No warranties are given. The license may not give you all of the +permissions necessary for your intended use. For example, other rights +such as publicity, privacy, or moral rights may limit how you use the +material.
  • +

Software +


Except where otherwise noted, the example programs and other software +provided by The Carpentries are made available under the OSI-approved MIT +license.


Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +“Software”), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions:


The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software.




Trademark +


“The Carpentries”, “Software Carpentry”, “Data Carpentry”, and +“Library Carpentry” and their respective logos are registered trademarks +of Community Initiatives.

+ + +
+ +
Back To Top +
+ + diff --git a/aio.html b/aio.html new file mode 100644 index 000000000..ea9c08cd6 --- /dev/null +++ b/aio.html @@ -0,0 +1,12657 @@ + + + + + +R for Reproducible Scientific Analysis: All in One View + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + + +
+ + +

Content from Introduction to R and RStudio


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How to find your way around RStudio?
  • +
  • How to interact with R?
  • +
  • How to manage your environment?
  • +
  • How to install packages?
  • +


  • Describe the purpose and use of each pane in the RStudio IDE
  • +
  • Locate buttons and options in the RStudio IDE
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Manage a workspace in an interactive R session
  • +
  • Use mathematical and comparison operators
  • +
  • Call functions
  • +
  • Manage packages
  • +

Motivation +


Science is a multi-step process: once you’ve designed an experiment +and collected data, the real fun begins! This lesson will teach you how +to start this process using R and RStudio. We will begin with raw data, +perform exploratory analyses, and learn how to plot results graphically. +This example starts with a dataset from gapminder.org containing population +information for many countries through time. Can you read the data into +R? Can you plot the population for Senegal? Can you calculate the +average income for countries on the continent of Asia? By the end of +these lessons you will be able to do things like plot the populations +for all of these countries in under a minute!


Before Starting The Workshop +


Please ensure you have the latest version of R and RStudio installed +on your machine. This is important, as some packages used in the +workshop may not install correctly (or at all) if R is not up to +date.


Introduction to RStudio +


Welcome to the R portion of the Software Carpentry workshop.


Throughout this lesson, we’re going to teach you some of the +fundamentals of the R language as well as some best practices for +organizing code for scientific projects that will make your life +easier.


We’ll be using RStudio: a free, open-source R Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms +(including on servers) and provides many advantages such as integration +with version control and project management.


Basic layout


When you first open RStudio, you will be greeted by three panels:

  • The interactive R console/Terminal (entire left)
  • +
  • Environment/History/Connections (tabbed in upper right)
  • +
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • +
RStudio layout

Once you open files, such as R scripts, an editor panel will also +open in the top left.

RStudio layout with .R file open
+ +

R scripts +


Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have .R at the end of their +names to let you know what they are.


Workflow within RStudio +


There are two main ways one can work within RStudio:

  1. Test and play within the interactive R console then copy code into a +.R file to run later.
  2. +
  • This works well when doing small tests and initially starting +off.
  • +
  • It quickly becomes laborious
  • +
  1. Start writing in a .R file and use RStudio’s short cut keys for the +Run command to push the current line, selected lines or modified lines +to the interactive R console.
  2. +
  • This is a great way to start; all your code is saved for later
  • +
  • You will be able to run the file you create from within RStudio or +using R’s source() function.
  • +
+ +

Tip: Running segments of your code +


RStudio offers you great flexibility in running code from within the +editor window. There are buttons, menu choices, and keyboard shortcuts. +To run the current line, you can

  1. click on the Run button above the editor panel, or
  2. +
  3. select “Run Lines” from the “Code” menu, or
  4. +
  5. hit Ctrl+Return in Windows or Linux or ++Return on OS X. (This shortcut can also be seen +by hovering the mouse over the button). To run a block of code, select +it and then Run. If you have modified a line of code within +a block of code you have just run, there is no need to reselect the +section and Run, you can use the next button along, +Re-run the previous region. This will run the previous code +block including the modifications you have made.
  6. +

Introduction to R +


Much of your time in R will be spent in the R interactive console. +This is where you will run all of your code, and can be a useful +environment to try out ideas before adding them to an R script file. +This console in RStudio is the same as the one you would get if you +typed in R in your command-line environment.


The first thing you will see in the R interactive session is a bunch +of information, followed by a “>” and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a “Read, evaluate, print +loop”: you type in commands, R tries to execute them, and then returns a +result.


Using R as a calculator +


The simplest thing you could do with R is to do arithmetic:


R +

+1 + 100


[1] 101

And R will print out the answer, with a preceding “[1]”. [1] is the +index of the first element of the line being printed in the console. For +more information on indexing vectors, see Episode +6: Subsetting Data.


If you type in an incomplete command, R will wait for you to complete +it. If you are familiar with Unix Shell’s bash, you may recognize +this
+behavior from bash.


R +

> 1 +



Any time you hit return and the R session shows a “+” instead of a +“>”, it means it’s waiting for you to complete the command. If you +want to cancel a command you can hit Esc and RStudio will +give you back the “>” prompt.

+ +

Tip: Canceling commands +


If you’re using R from the command line instead of from within +RStudio, you need to use Ctrl+C instead of +Esc to cancel the command. This applies to Mac users as +well!


Canceling a command isn’t only useful for killing incomplete +commands: you can also use it to tell R to stop running code (for +example if it’s taking much longer than you expect), or to get rid of +the code you’re currently writing.


When using R as a calculator, the order of operations is the same as +you would have learned back in school.


From highest to lowest precedence:

  • Parentheses: (, ) +
  • +
  • Exponents: ^ or ** +
  • +
  • Multiply: * +
  • +
  • Divide: / +
  • +
  • Add: + +
  • +
  • Subtract: - +
  • +

R +

+3 + 5 * 2


[1] 13

Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend.


R +

+(3 + 5) * 2


[1] 16

This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code.


R +

+(3 + (5 * (2 ^ 2))) # hard to read
+3 + 5 * 2 ^ 2       # clear, if you remember the rules
+3 + 5 * (2 ^ 2)     # if you forget some rules, this might help

The text after each line of code is called a “comment”. Anything that +follows after the hash (or octothorpe) symbol # is ignored +by R when it executes code.


Really small or large numbers get a scientific notation:


R +



[1] 2e-04

Which is shorthand for “multiplied by 10^XX”. So +2e-4 is shorthand for 2 * 10^(-4).


You can write numbers in scientific notation too:


R +

+5e3  # Note the lack of minus here


[1] 5000

Mathematical functions +


R has many built in mathematical functions. To call a function, we +can type its name, followed by open and closing parentheses. Functions +take arguments as inputs, anything we type inside the parentheses of a +function is considered an argument. Depending on the function, the +number of arguments can vary from none to multiple. For example:


R +

+getwd() #returns an absolute filepath

doesn’t require an argument, whereas for the next set of mathematical +functions we will need to supply the function a value in order to +compute the result.


R +

+sin(1)  # trigonometry functions


[1] 0.841471

R +

+log(1)  # natural logarithm


[1] 0

R +

+log10(10) # base-10 logarithm


[1] 1

R +

+exp(0.5) # e^(1/2)


[1] 1.648721

Don’t worry about trying to remember every function in R. You can +look them up on Google, or if you can remember the start of the +function’s name, use the tab completion in RStudio.


This is one advantage that RStudio has over R on its own, it has +auto-completion abilities that allow you to more easily look up +functions, their arguments, and the values that they take.


Typing a ? before the name of a command will open the +help page for that command. When using RStudio, this will open the +‘Help’ pane; if using R in the terminal, the help page will open in your +browser. The help page will include a detailed description of the +command and how it works. Scrolling to the bottom of the help page will +usually show a collection of code examples which illustrate command +usage. We’ll go through an example later.


Comparing things +


We can also do comparisons in R:


R +

+1 == 1  # equality (note two equals signs, read as "is equal to")


[1] TRUE

R +

+1 != 2  # inequality (read as "is not equal to")


[1] TRUE

R +

+1 < 2  # less than


[1] TRUE

R +

+1 <= 1  # less than or equal to


[1] TRUE

R +

+1 > 0  # greater than


[1] TRUE

R +

+1 >= -9 # greater than or equal to


[1] TRUE
+ +

Tip: Comparing Numbers +


A word of warning about comparing numbers: you should never use +== to compare two numbers unless they are integers (a data +type which can specifically represent only whole numbers).


Computers may only represent decimal numbers with a certain degree of +precision, so two numbers which look the same when printed out by R, may +actually have different underlying representations and therefore be +different by a small margin of error (called Machine numeric +tolerance).


Instead you should use the all.equal function.


Further reading: http://floating-point-gui.de/


Variables and assignment +


We can store values in variables using the assignment operator +<-, like this:


R +

+x <- 1/40

Notice that assignment does not print a value. Instead, we stored it +for later in something called a variable. +x now contains the value +0.025:


R +



[1] 0.025

More precisely, the stored value is a decimal approximation +of this fraction called a floating point +number.


Look for the Environment tab in the top right panel of +RStudio, and you will see that x and its value have +appeared. Our variable x can be used in place of a number +in any calculation that expects a number:


R +



[1] -3.688879

Notice also that variables can be reassigned:


R +

+x <- 100

x used to contain the value 0.025 and now it has the +value 100.


Assignment values can contain the variable being assigned to:


R +

+x <- x + 1 #notice how RStudio updates its description of x on the top right tab
+y <- x * 2

The right hand side of the assignment can be any valid R expression. +The right hand side is fully evaluated before the assignment +occurs.


Variable names can contain letters, numbers, underscores and periods +but no spaces. They must start with a letter or a period followed by a +letter (they cannot start with a number nor an underscore). Variables +beginning with a period are hidden variables. Different people use +different conventions for long variable names, these include

  • periods.between.words
  • +
  • underscores_between_words
  • +
  • camelCaseToSeparateWords
  • +

What you use is up to you, but be consistent.


It is also possible to use the = operator for +assignment:


R +

+x = 1/40

But this is much less common among R users. The most important thing +is to be consistent with the operator you use. There +are occasionally places where it is less confusing to use +<- than =, and it is the most common symbol +used in the community. So the recommendation is to use +<-.

+ +

Challenge 1 +


Which of the following are valid R variable names?


R +

+ +

The following can be used as R variables:


R +


The following creates a hidden variable:


R +


The following will not be able to be used to create a variable


R +


Vectorization +


One final thing to be aware of is that R is vectorized, +meaning that variables and functions can have vectors as values. In +contrast to physics and mathematics, a vector in R describes a set of +values in a certain order of the same data type. For example


R +



[1] 1 2 3 4 5

R +



[1]  2  4  8 16 32

R +

+x <- 1:5


[1]  2  4  8 16 32

This is incredibly powerful; we will discuss this further in an +upcoming lesson.


Managing your environment +


There are a few useful commands you can use to interact with the R +session.


ls will list all of the variables and functions stored +in the global environment (your working R session):


R +



[1] "x" "y"
+ +

Tip: hidden objects +


Like in the shell, ls will hide any variables or +functions starting with a “.” by default. To list all objects, type +ls(all.names=TRUE) instead


Note here that we didn’t give any arguments to ls, but +we still needed to give the parentheses to tell R to call the +function.


If we type ls by itself, R prints a bunch of code +instead of a listing of objects.


R +



function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
+    pattern, sorted = TRUE) 
+    if (!missing(name)) {
+        pos <- tryCatch(name, error = function(e) e)
+        if (inherits(pos, "error")) {
+            name <- substitute(name)
+            if (!is.character(name)) 
+                name <- deparse(name)
+            warning(gettextf("%s converted to character string", 
+                sQuote(name)), domain = NA)
+            pos <- name
+        }
+    }
+    all.names <- .Internal(ls(envir, all.names, sorted))
+    if (!missing(pattern)) {
+        if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
+            ll != length(grep("]", pattern, fixed = TRUE))) {
+            if (pattern == "[") {
+                pattern <- "\\["
+                warning("replaced regular expression pattern '[' by  '\\\\['")
+            }
+            else if (length(grep("[^\\\\]\\[<-", pattern))) {
+                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+            }
+        }
+        grep(pattern, all.names, value = TRUE)
+    }
+    else all.names
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>

What’s going on here?


Like everything in R, ls is the name of an object, and +entering the name of an object by itself prints the contents of the +object. The object x that we created earlier contains 1, 2, +3, 4, 5:


R +



[1] 1 2 3 4 5

The object ls contains the R code that makes the +ls function work! We’ll talk more about how functions work +and start writing our own later.


You can use rm to delete objects you no longer need:


R +


If you have lots of things in your environment and want to delete all +of them, you can pass the results of ls to the +rm function:


R +

+rm(list = ls())

In this case we’ve combined the two. Like the order of operations, +anything inside the innermost parentheses is evaluated first, and so +on.


In this case we’ve specified that the results of ls +should be used for the list argument in rm. +When assigning values to arguments by name, you must use the += operator!!


If instead we use <-, there will be unintended side +effects, or you may get an error message:


R +

+rm(list <- ls())


Error in rm(list <- ls()): ... must contain names or character strings
+ +

Tip: Warnings vs. Errors +


Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn’t worked as expected.


In both cases, the message that R prints out usually give you clues +how to fix a problem.


R Packages +


It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages:

  • You can see what packages are installed by typing +installed.packages() +
  • +
  • You can install packages by typing +install.packages("packagename"), where +packagename is the package name, in quotes.
  • +
  • You can update installed packages by typing +update.packages() +
  • +
  • You can remove a package with +remove.packages("packagename") +
  • +
  • You can make a package available for use with +library(packagename) +
  • +

Packages can also be viewed, loaded, and detached in the Packages tab +of the lower right panel in RStudio. Clicking on this tab will display +all of the installed packages with a checkbox next to them. If the box +next to a package name is checked, the package is loaded and if it is +empty, the package is not loaded. Click an empty box to load that +package and click a checked box to detach that package.


Packages can be installed and updated from the Package tab with the +Install and Update buttons at the top of the tab.

+ +

Challenge 2 +


What will be the value of each variable after each statement in the +following program?


R +

+mass <- 47.5
+age <- 122
+mass <- mass * 2.3
+age <- age - 20
+ +

R +

+mass <- 47.5

This will give a value of 47.5 for the variable mass


R +

+age <- 122

This will give a value of 122 for the variable age


R +

+mass <- mass * 2.3

This will multiply the existing value of 47.5 by 2.3 to give a new +value of 109.25 to the variable mass.


R +

+age <- age - 20

This will subtract 20 from the existing value of 122 to give a new +value of 102 to the variable age.

+ +

Challenge 3 +


Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age?

+ +

One way of answering this question in R is to use the +> to set up the following:


R +

+mass > age


[1] TRUE

This should yield a boolean value of TRUE since 109.25 is greater +than 102.

+ +

Challenge 4 +


Clean up your working environment by deleting the mass and age +variables.

+ +

We can use the rm command to accomplish this task


R +

+rm(age, mass)
+ +

Challenge 5 +


Install the following packages: ggplot2, +plyr, gapminder

+ +

We can use the install.packages() command to install the +required packages.


R +


An alternate solution, to install multiple packages with a single +install.packages() command is:


R +

+install.packages(c("ggplot2", "plyr", "gapminder"))
+ +

Keypoints +

  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +

Content from Project Management With RStudio


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I manage my projects in R?
  • +


  • Create self-contained projects in RStudio
  • +

Introduction +


The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and eventually +everything is a bit mixed together.

+ +

Most people tend to organize their projects like this:

Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

  1. It is really hard to tell which version of your data is the original +and which is the modified;
  2. +
  3. It gets really messy because it mixes files with various extensions +together;
  4. +
  5. It probably takes you a lot of time to actually find things, and +relate the correct figures to the exact code that has been used to +generate it;
  6. +

A good project layout will ultimately make your life easier:

  • It will help ensure the integrity of your data;
  • +
  • It makes it simpler to share your code with someone else (a +lab-mate, collaborator, or supervisor);
  • +
  • It allows you to easily upload your code with your manuscript +submission;
  • +
  • It makes it easier to pick the project back up after a break.
  • +

A possible solution +


Fortunately, there are tools and packages which can help you manage +your work effectively.


One of the most powerful and useful aspects of RStudio is its project +management functionality. We’ll be using this today to create a +self-contained, reproducible project.

+ +

Challenge 1: Creating a self-contained +project +


We’re going to create a new project in RStudio:

  1. Click the “File” menu button, then “New Project”.
  2. +
  3. Click “New Directory”.
  4. +
  5. Click “New Project”.
  6. +
  7. Type in the name of the directory to store your project, +e.g. “my_project”.
  8. +
  9. If available, select the checkbox for “Create a git +repository.”
  10. +
  11. Click the “Create Project” button.
  12. +

The simplest way to open an RStudio project once it has been created +is to click through your file system to get to the directory where it +was saved and double click on the .Rproj file. This will +open RStudio and start your R session in the same directory as the +.Rproj file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added +benefit of allowing you to open multiple projects at the same time each +open to its own project directory. This allows you to keep multiple +projects open without them interfering with each other.

+ +

Challenge 2: Opening an RStudio project +through the file system +

  1. Exit RStudio.
  2. +
  3. Navigate to the directory where you created a project in Challenge +1.
  4. +
  5. Double click on the .Rproj file in that directory.
  6. +

Best practices for project organization +


Although there is no “best” way to lay out a project, there are some +general principles to adhere to that will make project management +easier:


Treat data as read only +


This is probably the most important goal of setting up a project. +Data is typically time consuming and/or expensive to collect. Working +with them interactively (e.g., in Excel) where they can be modified +means you are never sure of where the data came from, or how it has been +modified since collection. It is therefore a good idea to treat your +data as “read-only”.


Data Cleaning +


In many cases your data will be “dirty”: it will need significant +preprocessing to get into a format R (or any other programming language) +will find useful. This task is sometimes called “data munging”. Storing +these scripts in a separate folder, and creating a second “read-only” +data folder to hold the “cleaned” data sets can prevent confusion +between the two sets.


Treat generated output as disposable +


Anything generated by your scripts should be treated as disposable: +it should all be able to be regenerated from your scripts.


There are lots of different ways to manage this output. Having an +output folder with different sub-directories for each separate analysis +makes it easier later. Since many analyses are exploratory and don’t end +up being used in the final project, and some of the analyses get shared +between projects.

+ +

Tip: Good Enough Practices for Scientific +Computing +


Good +Enough Practices for Scientific Computing gives the following +recommendations for project organization:

  1. Put each project in its own directory, which is named after the +project.
  2. +
  3. Put text documents associated with the project in the +doc directory.
  4. +
  5. Put raw data and metadata in the data directory, and +files generated during cleanup and analysis in a results +directory.
  6. +
  7. Put source for the project’s scripts and programs in the +src directory, and programs brought in from elsewhere or +compiled locally in the bin directory.
  8. +
  9. Name all files to reflect their content or function.
  10. +

Separate function definition and application +


One of the more effective ways to work with R is to start by writing +the code you want to run directly in a .R script, and then running the +selected lines (either using the keyboard shortcuts in RStudio or +clicking the “Run” button) in the interactive R console.


When your project is in its early stages, the initial .R script file +usually contains many lines of directly executed code. As it matures, +reusable chunks get pulled into their own functions. It’s a good idea to +separate these functions into two separate folders; one to store useful +functions that you’ll reuse across analyses and projects, and one to +store the analysis scripts.


Save the data in the data directory +


Now we have a good directory structure we will now place/save the +data file in the data/ directory.

+ +

Challenge 3 +


Download the gapminder data from here.

  1. Download the file (right mouse click on the link above -> “Save +link as” / “Save file as”, or click on the link and after the page +loads, press Ctrl+S or choose File -> “Save +page as”)
  2. +
  3. Make sure it’s saved under the name +gapminder_data.csv +
  4. +
  5. Save the file in the data/ folder within your +project.
  6. +

We will load and inspect these data later.

+ +

Challenge 4 +


It is useful to get some general idea about the dataset, directly +from the command line, before loading it into R. Understanding the +dataset better will come in handy when making decisions on how to load +it in R. Use the command-line shell to answer the following +questions:

  1. What is the size of the file?
  2. +
  3. How many rows of data does it contain?
  4. +
  5. What kinds of values are stored in this file?
  6. +
+ +

By running these commands in the shell:


SH +

ls -lh data/gapminder_data.csv


-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv

The file size is 80K.


SH +

wc -l data/gapminder_data.csv


1705 data/gapminder_data.csv

There are 1705 lines. The data looks like:


SH +

head data/gapminder_data.csv


+ +

Tip: command line in RStudio +


The Terminal tab in the console pane provides a convenient place +directly within RStudio to interact directly with the command line.


Working directory +


Knowing R’s current working directory is important because when you +need to access other files (for example, to import a data file), R will +look for them relative to the current working directory.


Each time you create a new RStudio Project, it will create a new +directory for that project. When you open an existing +.Rproj file, it will open that project and set R’s working +directory to the folder that file is in.

+ +

Challenge 5 +


You can check the current working directory with the +getwd() command, or by using the menus in RStudio.

  1. In the console, type getwd() (“wd” is short for +“working directory”) and hit Enter.
  2. +
  3. In the Files pane, double click on the data folder to +open it (or navigate to any other folder you wish). To get the Files +pane back to the current working directory, click “More” and then select +“Go To Working Directory”.
  4. +

You can change the working directory with setwd(), or by +using RStudio menus.

  1. In the console, type setwd("data") and hit Enter. Type +getwd() and hit Enter to see the new working +directory.
  2. +
  3. In the menus at the top of the RStudio window, click the “Session” +menu button, and then select “Set Working Directory” and then “Choose +Directory”. Next, in the windows navigator that opens, navigate back to +the project directory, and click “Open”. Note that a setwd +command will automatically appear in the console.
  4. +
+ +

Tip: File does not exist errors +


When you’re attempting to reference a file in your R code and you’re +getting errors saying the file doesn’t exist, it’s a good idea to check +your working directory. You need to either provide an absolute path to +the file, or you need to make sure the file is saved in the working +directory (or a subfolder of the working directory) and provide a +relative path.


Version Control +


It is important to use version control with projects. Go here +for a good lesson which describes using Git with RStudio.

+ +

Keypoints +

  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +

Content from Seeking Help


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I get help in R?
  • +


  • To be able to read R help files for functions and special +operators.
  • +
  • To be able to use CRAN task views to identify packages to solve a +problem.
  • +
  • To be able to seek help from your peers.
  • +

Reading Help Files +


R, and every package, provide help files for functions. The general +syntax to search for help on any function, “function_name”, from a +specific function that is in a package loaded into your namespace (your +interactive R session) is:


R +


For example take a look at the help file for +write.table(), we will be using a similar function in an +upcoming episode.


R +


This will load up a help page in RStudio (or as plain text in R +itself).


Each help page is broken down into sections:

  • Description: An extended description of what the function does.
  • +
  • Usage: The arguments of the function and their default values (which +can be changed).
  • +
  • Arguments: An explanation of the data each argument is +expecting.
  • +
  • Details: Any important details to be aware of.
  • +
  • Value: The data the function returns.
  • +
  • See Also: Any related functions you might find useful.
  • +
  • Examples: Some examples for how to use the function.
  • +

Different functions might have different sections, but these are the +main ones you should be aware of.


Notice how related functions might call for the same help file:


R +


This is because these functions have very similar applicability and +often share the same arguments as inputs to the function, so package +authors often choose to document them together in a single help +file.

+ +

Tip: Running Examples +


From within the function help page, you can highlight code in the +Examples and hit Ctrl+Return to run it in RStudio +console. This gives you a quick way to get a feel for how a function +works.

+ +

Tip: Reading Help Files +


One of the most daunting aspects of R is the large number of +functions available. It would be prohibitive, if not impossible to +remember the correct usage for every function you use. Luckily, using +the help files means you don’t have to remember that!


Special Operators +


To seek help on special operators, use quotes or backticks:


R +


Getting Help with Packages +


Many packages come with “vignettes”: tutorials and extended example +documentation. Without any arguments, vignette() will list +all vignettes for all installed packages; +vignette(package="package-name") will list all available +vignettes for package-name, and +vignette("vignette-name") will open the specified +vignette.


If a package doesn’t have any vignettes, you can usually find help by +typing help("package-name").


RStudio also has a set of excellent cheatsheets for +many packages.


When You Remember Part of the Function Name +


If you’re not sure what package a function is in or how it’s +specifically spelled, you can do a fuzzy search:


R +


A fuzzy search is when you search for an approximate string match. +For example, you may remember that the function to set your working +directory includes “set” in its name. You can do a fuzzy search to help +you identify the function:


R +


When You Have No Idea Where to Begin +


If you don’t know what function or package you need to use CRAN Task Views is a +specially maintained list of packages grouped into fields. This can be a +good starting point.


When Your Code Doesn’t Work: Seeking Help from Your Peers +


If you’re having trouble using a function, 9 times out of 10, the +answers you seek have already been answered on Stack Overflow. You can search +using the [r] tag. Please make sure to see their page on how to ask a good +question.


If you can’t find the answer, there are a few useful functions to +help you ask your peers:


R +


Will dump the data you’re working with into a format that can be +copied and pasted by others into their own R session.


R +



R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+Matrix products: default
+BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
+ [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
+time zone: UTC
+tzcode source: system (glibc)
+attached base packages:
+[1] stats     graphics  grDevices utils     datasets  methods   base     
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0 yaml_2.3.7       
+[5] knitr_1.43        xfun_0.40         renv_1.0.3        evaluate_0.21    

Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue.

+ +

Challenge 1 +


Look at the help page for the c function. What kind of +vector do you expect will be created if you evaluate the following:


R +

+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+ +

The c() function creates a vector, in which all elements +are of the same type. In the first case, the elements are numeric, in +the second, they are characters, and in the third they are also +characters: the numeric values are “coerced” to be characters.

+ +

Challenge 2 +


Look at the help for the paste function. You will need +to use it later. What’s the difference between the sep and +collapse arguments?

+ +

To look at the help for the paste() function, use:


R +


The difference between sep and collapse is +a little tricky. The paste function accepts any number of +arguments, each of which can be a vector of any length. The +sep argument specifies the string used between concatenated +terms — by default, a space. The result is a vector as long as the +longest argument supplied to paste. In contrast, +collapse specifies that after concatenation the elements +are collapsed together using the given separator, the result +being a single string.


It is important to call the arguments explicitly by typing out the +argument name e.g sep = "," so the function understands to +use the “,” as a separator and not a term to concatenate. e.g.


R +

+paste(c("a","b"), "c")


[1] "a c" "b c"

R +

+paste(c("a","b"), "c", ",")


[1] "a c ," "b c ,"

R +

+paste(c("a","b"), "c", sep = ",")


[1] "a,c" "b,c"

R +

+paste(c("a","b"), "c", collapse = "|")


[1] "a c|b c"

R +

+paste(c("a","b"), "c", sep = ",", collapse = "|")


[1] "a,c|b,c"

(For more information, scroll to the bottom of the +?paste help page and look at the examples, or try +example('paste').)

+ +

Challenge 3 +


Use help to find a function (and its associated parameters) that you +could use to load data from a tabular file in which columns are +delimited with “\t” (tab) and the decimal point is a “.” (period). This +check for decimal separator is important, especially if you are working +with international colleagues, because different countries have +different conventions for the decimal point (i.e. comma vs period). +Hint: use ??"read table" to look up functions related to +reading in tabular data.

+ +

The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +read.table(file, sep="\t") (the period is the +default decimal separator for read.table()), +although you may have to change the comment.char argument +as well if your data file contains hash (#) characters.


Other Resources +

+ +
+ +

Keypoints +

  • Use help() to get online help in R.
  • +

Content from Data Structures


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I read data in R?
  • +
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +


  • To be able to identify the 5 main data types.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
  • To understand the information of the attributes “names”, “class”, +and “dim”.
  • +

One of R’s most powerful features is its ability to deal with tabular +data - such as you may already have in a spreadsheet or a CSV file. +Let’s start by making a toy dataset in your data/ +directory, called feline-data.csv:


R +

+cats <- data.frame(coat = c("calico", "black", "tabby"),
+                    weight = c(2.1, 5.0, 3.2),
+                    likes_string = c(1, 0, 1))

We can now save cats as a CSV file. It is good practice +to call the argument names explicitly so the function knows what default +values you are changing. Here we are setting +row.names = FALSE. Recall you can use +?write.csv to pull up the help file to check out the +argument names and their default values.


R +

+write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)

The contents of the new file, feline-data.csv:


R +

+ +

Tip: Editing Text files in R +


Alternatively, you can create data/feline-data.csv using +a text editor (Nano), or within RStudio with the File -> New +File -> Text File menu item.


We can load this into R via the following:


R +

+cats <- read.csv(file = "data/feline-data.csv")


    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1

The read.table function is used for reading in tabular +data stored in a text file where the columns of data are separated by +punctuation characters such as CSV files (csv = comma-separated values). +Tabs and commas are the most common punctuation characters used to +separate or delimit data points in csv files. For convenience R provides +2 other versions of read.table. These are: +read.csv for files where the data are separated with commas +and read.delim for files where the data are separated with +tabs. Of these three functions read.csv is the most +commonly used. If needed it is possible to override the default +delimiting punctuation marks for both read.csv and +read.delim.

+ +

Check your data for factors +


In recent times, the default way how R handles textual data has +changed. Text data was interpreted by R automatically into a format +called “factors”. But there is an easier format that is called +“character”. We will hear about factors later, and what to use them for. +For now, remember that in most cases, they are not needed and only +complicate your life, which is why newer R versions read in text as +“character”. Check now if your version of R has automatically created +factors and convert them to “character” format:

  1. Check the data types of your input by typing +str(cats) +
  2. +
  3. In the output, look at the three-letter codes after the colons: If +you see only “num” and “chr”, you can continue with the lesson and skip +this box. If you find “fct”, continue to step 3.
  4. +
  5. Prevent R from automatically creating “factor” data. That can be +done by the following code: +options(stringsAsFactors = FALSE). Then, re-read the cats +table for the change to take effect.
  6. +
  7. You must set this option every time you restart R. To not forget +this, include it in your analysis script before you read in any data, +for example in one of the first lines.
  8. +
  9. For R versions greater than 4.0.0, text data is no longer converted +to factors anymore. So you can install this or a newer version to avoid +this problem. If you are working on an institute or company computer, +ask your administrator to do it.
  10. +

We can begin exploring our dataset right away, pulling out columns by +specifying them using the $ operator:


R +



[1] 2.1 5.0 3.2

R +



[1] "calico" "black"  "tabby" 

We can do other operations on the columns:


R +

+## Say we discovered that the scale weighs two Kg light:
+cats$weight + 2


[1] 4.1 7.0 5.2

R +

+paste("My cat is", cats$coat)


[1] "My cat is calico" "My cat is black"  "My cat is tabby" 

But what about


R +

+cats$weight + cats$coat


Error in cats$weight + cats$coat: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing +data in R.


Data Types +


If you guessed that the last command will return an error because +2.1 plus "black" is nonsense, you’re right - +and you already have some intuition for an important concept in +programming called data types. We can ask what type of data +something is:


R +



[1] "double"

There are 5 main types: double, integer, +complex, logical and character. +For historic reasons, double is also called +numeric.


R +



[1] "double"

R +

+typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers


[1] "integer"

R +



[1] "complex"

R +



[1] "logical"

R +



[1] "character"

No matter how complicated our analyses become, all data in R is +interpreted as one of these basic data types. This strictness has some +really important consequences.


A user has added details of another cat. This information is in the +file data/feline-data_v2.csv.


R +


R +

+tabby,2.3 or 2.4,1

Load the new cats data like before, and check what type of data we +find in the weight column:


R +

+cats <- read.csv(file="data/feline-data_v2.csv")


[1] "character"

Oh no, our weights aren’t the double type anymore! If we try to do +the same math we did on them before, we run into trouble:


R +

+cats$weight + 2


Error in cats$weight + 2: non-numeric argument to binary operator

What happened? The cats data we are working with is +something called a data frame. Data frames are one of the most +common and versatile types of data structures we will work with +in R. A given column in a data frame cannot be composed of different +data types. In this case, R does not read everything in the data frame +column weight as a double, therefore the entire +column data type changes to something that is suitable for everything in +the column.


When R reads a csv file, it reads it in as a data frame. +Thus, when we loaded the cats csv file, it is stored as a +data frame. We can recognize data frames by the first row that is +written by the str() function:


R +



'data.frame':	4 obs. of  3 variables:
+ $ coat        : chr  "calico" "black" "tabby" "tabby"
+ $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
+ $ likes_string: int  1 0 1 1

Data frames are composed of rows and columns, where each +column has the same number of rows. Different columns in a data frame +can be made up of different data types (this is what makes them so +versatile), but everything in a given column needs to be the same type +(e.g., vector, factor, or list).


Let’s explore more about different data structures and how they +behave. For now, let’s remove that extra line from our cats data and +reload it, while we investigate this behavior further:




And back in RStudio:


R +

+cats <- read.csv(file="data/feline-data.csv")

Vectors and Type Coercion +


To better understand this behavior, let’s meet another of the data +structures: the vector.


R +

+my_vector <- vector(length = 3)



A vector in R is essentially an ordered list of things, with the +special condition that everything in the vector must be the same +basic data type. If you don’t choose the datatype, it’ll default to +logical; or, you can declare an empty vector of whatever +type you like.


R +

+another_vector <- vector(mode='character', length=3)


[1] "" "" ""

You can check if something is a vector:


R +



 chr [1:3] "" "" ""

The somewhat cryptic output from this command indicates the basic +data type found in this vector - in this case chr, +character; an indication of the number of things in the vector - +actually, the indexes of the vector, in this case [1:3]; +and a few examples of what’s actually in the vector - in this case empty +character strings. If we similarly do


R +



 num [1:3] 2.1 5 3.2

we see that cats$weight is a vector, too - the +columns of data we load into R data.frames are all vectors, and +that’s the root of why R forces everything in a column to be the same +basic data type.

+ +

Discussion 1 +


Why is R so opinionated about what we put in our columns of data? How +does this help us?

+ +

By keeping everything in a column the same, we allow ourselves to +make simple assumptions about our data; if you can interpret one entry +in the column as a number, then you can interpret all of them +as numbers, so we don’t have to check every time. This consistency is +what people mean when they talk about clean data; in the long +run, strict consistency goes a long way to making our lives easier in +R.


Coercion by combining vectors +


You can also make vectors with explicit contents with the combine +function:


R +

+combine_vector <- c(2,6,3)


[1] 2 6 3

Given what we’ve learned so far, what do you think the following will +produce?


R +

+quiz_vector <- c(2,6,'3')

This is something called type coercion, and it is the source +of many surprises and the reason why we need to be aware of the basic +data types and how R will interpret them. When R encounters a mix of +types (here double and character) to be combined into a single vector, +it will force them all to be the same type. Consider:


R +

+coercion_vector <- c('a', TRUE)


[1] "a"    "TRUE"

R +

+another_coercion_vector <- c(0, TRUE)


[1] 0 1

The type hierarchy +


The coercion rules go: logical -> +integer -> double (“numeric”) +-> complex -> character, where -> can +be read as are transformed into. For example, combining +logical and character transforms the result to +character:


R +

+c('a', TRUE)


[1] "a"    "TRUE"

A quick way to recognize character vectors is by the +quotes that enclose them when they are printed.


You can try to force coercion against this flow using the +as. functions:


R +

+character_vector_example <- c('0','2','4')


[1] "0" "2" "4"

R +

+character_coerced_to_double <- as.double(character_vector_example)


[1] 0 2 4

R +

+double_coerced_to_logical <- as.logical(character_coerced_to_double)



As you can see, some surprising things can happen when R forces one +basic data type into another! Nitty-gritty of type coercion aside, the +point is: if your data doesn’t look like what you thought it was going +to look like, type coercion may well be to blame; make sure everything +is the same type in your vectors and your columns of data.frames, or you +will get nasty surprises!


But coercion can also be very useful! For example, in our +cats data likes_string is numeric, but we know +that the 1s and 0s actually represent TRUE and +FALSE (a common way of representing them). We should use +the logical datatype here, which has two states: +TRUE or FALSE, which is exactly what our data +represents. We can ‘coerce’ this column to be logical by +using the as.logical function:


R +



[1] 1 0 1

R +

+cats$likes_string <- as.logical(cats$likes_string)


+ +

Challenge 1 +


An important part of every data analysis is cleaning the input data. +If you know that the input data is all of the same format, +(e.g. numbers), your analysis is much easier! Clean the cat data set +from the chapter about type coercion.


Copy the code template +


Create a new script in RStudio and copy and paste the following code. +Then move on to the tasks below, which help you to fill in the gaps +(______).

# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+# 1. Print the data
+# 2. Show an overview of the table with all data types
+# 3. The "weight" column has the incorrect data type __________.
+#    The correct data type is: ____________.
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+#    print the data again to see the effect
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+#    Calculate the mean to test yourself
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!

Instructions for the tasks +

+ +

Execute the first statement (read.csv(...)). Then print +the data to the console

+ +

Show the content of any variable by typing its name.


Solution to Challenge 1.1 +


Two correct solutions:

+ +

2. Overview of the data types +


The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of +the cats table.

+ +

In the chapter “Data types” we saw two functions that can show data +types. One printed just a single word, the data type name. The other +printed a short form of the data type, and the first few values. We need +the second here.

+ +

Challenge 1 (continued) +


Solution to Challenge 1.2


3. Which data type do we need? +


The shown data type is not the right one for this data (weight of a +cat). Which data type do we need?

  • Why did the read.csv() function not choose the correct +data type?
  • +
  • Fill in the gap in the comment with the correct data type for cat +weight!
  • +
+ +

Scroll up to the section about the type +hierarchy to review the available data types

+ +
  • Weight is expressed on a continuous scale (real numbers). The R data +type for this is “double” (also known as “numeric”).
  • +
  • The fourth row has the value “2.3 or 2.4”. That is not a number but +two, and an english word. Therefore, the “character” data type is +chosen. The whole column is now text, because all values in the same +columns have to be the same data type.
  • +
+ +

4. Correct the problematic value +


The code to assign a new weight value to the problematic fourth row +is given. Think first and then execute it: What will be the data type +after assigning a number like in this example? You can check the data +type after executing to see if you were right.

+ +

Revisit the hierarchy of data types when two different data types are +combined.

+ +

Challenge 1 (continued) +


Solution to challenge 1.4


The data type of the column “weight” is “character”. The assigned +data type is “double”. Combining two data types yields the data type +that is higher in the following hierarchy:

logical < integer < double < complex < character

Therefore, the column is still of type character! We need to manually +convert it to “double”. {: .solution}


5. Convert the column “weight” to the correct data type +


Cat weight are numbers. But the column does not have this data type +yet. Coerce the column to floating point numbers.

+ +

The functions to convert data types start with as.. You +can look for the function further up in the manuscript or use the +RStudio auto-complete function: Type “as.” and then press +the TAB key.

+ +

Challenge 1 (continued) +


Solution to Challenge 1.5


There are two functions that are synonymous for historic reasons:

cats$weight <- as.double(cats$weight)
+cats$weight <- as.numeric(cats$weight)

Some basic vector functions +


The combine function, c(), will also append things to an +existing vector:


R +

+ab_vector <- c('a', 'b')


[1] "a" "b"

R +

+combine_example <- c(ab_vector, 'SWC')


[1] "a"   "b"   "SWC"

You can also make series of numbers:


R +

+mySeries <- 1:10


 [1]  1  2  3  4  5  6  7  8  9 10

R +



 [1]  1  2  3  4  5  6  7  8  9 10

R +

+seq(1,10, by=0.1)


 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
+[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
+[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
+[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
+[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
+[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
+[91] 10.0

We can ask a few questions about vectors:


R +

+sequence_example <- 20:25
+head(sequence_example, n=2)


[1] 20 21

R +

+tail(sequence_example, n=4)


[1] 22 23 24 25

R +



[1] 6

R +



[1] "integer"

We can get individual elements of a vector by using the bracket +notation:


R +

+first_element <- sequence_example[1]


[1] 20

To change a single element, use the bracket on the other side of the +arrow:


R +

+sequence_example[1] <- 30


[1] 30 21 22 23 24 25
+ +

Challenge 2 +


Start by making a vector with the numbers 1 through 26. Then, +multiply the vector by 2.

+ +

R +

+x <- 1:26
+x <- x * 2

Lists +


Another data structure you’ll want in your bag of tricks is the +list. A list is simpler in some ways than the other types, +because you can put anything you want in it. Remember everything in +the vector must be of the same basic data type, but a list can have +different data types:


R +

+list_example <- list(1, "a", TRUE, 1+4i)


+[1] 1
+[1] "a"
+[1] TRUE
+[1] 1+4i

When printing the object structure with str(), we see +the data types of all elements:


R +



List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i

What is the use of lists? They can organize data of different +types. For example, you can organize different tables that +belong together, similar to spreadsheets in Excel. But there are many +other uses, too.


We will see another example that will maybe surprise you in the next +chapter.


To retrieve one of the elements of a list, use the double +bracket:


R +



[1] "a"

The elements of lists also can have names, they can +be given by prepending them to the values, separated by an equals +sign:


R +

+another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )


+[1] "Numbers"
+ [1]  1  2  3  4  5  6  7  8  9 10
+[1] TRUE

This results in a named list. Now we have a new +function of our object! We can access single elements by an additional +way!


R +



[1] "Numbers"

Names +


With names, we can give meaning to elements. It is the first time +that we do not only have the data, but also explaining +information. It is metadata that can be stuck to the object +like a label. In R, this is called an attribute. Some +attributes enable us to do more with our object, for example, like here, +accessing an element by a self-defined name.


Accessing vectors and lists by name +


We have already seen how to generate a named list. The way to +generate a named vector is very similar. You have seen this function +before:


R +

+pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )

The way to retrieve elements is different, though:


R +



+       5.64 

The approach used for the list does not work:


R +



Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors

It will pay off if you remember this error message, you will meet it +in your own analyses. It means that you have just tried accessing an +element like it was in a list, but it is actually in a vector.


Accessing and changing names +


If you are only interested in the names, use the names() +function:


R +



[1] "pizzasubito" "pizzafresh"  "callapizza" 

We have seen how to access and change single elements of a vector. +The same is possible for names:


R +



[1] "callapizza"

R +

+names(pizza_price)[3] <- "call-a-pizza"


 pizzasubito   pizzafresh call-a-pizza 
+        5.64         6.60         4.50 
+ +

Challenge 3 +

  • What is the data type of the names of pizza_price? You +can find out using the str() or typeof() +functions.
  • +
+ +

You get the names of an object by wrapping the object name inside +names(...). Similarly, you get the data type of the names +by again wrapping the whole code in typeof(...):


alternatively, use a new variable if this is easier for you to +read:

n <- names(pizza)
+ +

Challenge 4 +


Instead of just changing some of the names a vector/list already has, +you can also set all names of an object by writing code like (replace +ALL CAPS text):


Create a vector that gives the number for each letter in the +alphabet!

  1. Generate a vector called letter_no with the sequence of +numbers from 1 to 26!
  2. +
  3. R has a built-in object called LETTERS. It is a +26-character vector, from A to Z. Set the names of the number sequence +to this 26 letters
  4. +
  5. Test yourself by calling letter_no["B"], which should +give you the number 2!
  6. +
+ +
letter_no <- 1:26   # or seq(1,26)
+names(letter_no) <- LETTERS

Data frames +


We have data frames at the very beginning of this lesson, they +represent a table of data. We didn’t go much further into detail with +our example cat data frame:


R +



    coat weight likes_string
+1 calico    2.1         TRUE
+2  black    5.0        FALSE
+3  tabby    3.2         TRUE

We can now understand something a bit surprising in our data.frame; +what happens if we run:


R +



[1] "list"

We see that data.frames look like lists ‘under the hood’. Think again +what we heard about what lists can be used for:


Lists organize data of different types


Columns of a data frame are vectors of different types, that are +organized by belonging to the same table.


A data.frame is really a list of vectors. It is a special list in +which all the vectors must have the same length.


How is this “special”-ness written into the object, so that R does +not treat it like any other list, but as a table?


R +



[1] "data.frame"

A class, just like names, is an attribute attached +to the object. It tells us what this object means for humans.


You might wonder: Why do we need another +what-type-of-object-is-this-function? We already have +typeof()? That function tells us how the object is +constructed in the computer. The class is +the meaning of the object for humans. Consequently, +what typeof() returns is fixed in R (mainly the +five data types), whereas the output of class() is +diverse and extendable by R packages.


In our cats example, we have an integer, a double and a +logical variable. As we have seen already, each column of data.frame is +a vector.


R +



[1] "calico" "black"  "tabby" 

R +



[1] "calico" "black"  "tabby" 

R +



[1] "character"

R +



 chr [1:3] "calico" "black" "tabby"

Each row is an observation of different variables, itself a +data.frame, and thus can be composed of elements of different types.


R +



    coat weight likes_string
+1 calico    2.1         TRUE

R +



[1] "list"

R +



'data.frame':	1 obs. of  3 variables:
+ $ coat        : chr "calico"
+ $ weight      : num 2.1
+ $ likes_string: logi TRUE
+ +

Challenge 5 +


There are several subtly different ways to call variables, +observations and elements from data.frames:

  • cats[1]
  • +
  • cats[[1]]
  • +
  • cats$coat
  • +
  • cats["coat"]
  • +
  • cats[1, 1]
  • +
  • cats[, 1]
  • +
  • cats[1, ]
  • +

Try out these examples and explain what is returned by each one.


Hint: Use the function typeof() to examine what +is returned in each case.

+ +

R +



+1 calico
+2  black
+3  tabby

We can think of a data frame as a list of vectors. The single brace +[1] returns the first slice of the list, as another list. +In this case it is the first column of the data frame.


R +



[1] "calico" "black"  "tabby" 

The double brace [[1]] returns the contents of the list +item. In this case it is the contents of the first column, a +vector of type character.


R +



[1] "calico" "black"  "tabby" 

This example uses the $ character to address items by +name. coat is the first column of the data frame, again a +vector of type character.


R +



+1 calico
+2  black
+3  tabby

Here we are using a single brace ["coat"] replacing the +index number with the column name. Like example 1, the returned object +is a list.


R +

+cats[1, 1]


[1] "calico"

This example uses a single brace, but this time we provide row and +column coordinates. The returned object is the value in row 1, column 1. +The object is a vector of type character.


R +

+cats[, 1]


[1] "calico" "black"  "tabby" 

Like the previous example we use single braces and provide row and +column coordinates. The row coordinate is not specified, R interprets +this missing value as all the elements in this column and +returns them as a vector.


R +

+cats[1, ]


    coat weight likes_string
+1 calico    2.1         TRUE

Again we use the single brace with row and column coordinates. The +column coordinate is not specified. The return value is a list +containing all the values in the first row.

+ +

Tip: Renaming data frame columns +


Data frames have column names, which can be accessed with the +names() function.


R +



[1] "coat"         "weight"       "likes_string"

If you want to rename the second column of cats, you can +assign a new name to the second element of names(cats).


R +

+names(cats)[2] <- "weight_kg"


    coat weight_kg likes_string
+1 calico       2.1         TRUE
+2  black       5.0        FALSE
+3  tabby       3.2         TRUE

Matrices +


Last but not least is the matrix. We can declare a matrix full of +zeros:


R +

+matrix_example <- matrix(0, ncol=6, nrow=3)


     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    0    0    0    0    0    0
+[2,]    0    0    0    0    0    0
+[3,]    0    0    0    0    0    0

What makes it special is the dim() attribute:


R +



[1] 3 6

And similar to other data structures, we can ask things about our +matrix:


R +



[1] "double"

R +



[1] "matrix" "array" 

R +



 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...

R +



[1] 3

R +



[1] 6
+ +

Challenge 6 +


What do you think will be the result of +length(matrix_example)? Try it. Were you right? Why / why +not?

+ +

What do you think will be the result of +length(matrix_example)?


R +

+matrix_example <- matrix(0, ncol=6, nrow=3)


[1] 18

Because a matrix is a vector with added dimension attributes, +length gives you the total number of elements in the +matrix.

+ +

Challenge 7 +


Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+ +

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)


R +

+x <- matrix(1:50, ncol=5, nrow=10)
+x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
+ +

Challenge 8 +


Create a list of length two containing a character vector for each of +the sections in this part of the workshop:

  • Data types
  • +
  • Data structures
  • +

Populate each character vector with the names of the data types and +data structures we’ve seen so far.

+ +

R +

+dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
+dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
+answer <- list(dataTypes, dataStructures)

Note: it’s nice to make a list in big writing on the board or taped +to the wall listing all of these types and structures - leave it up for +the rest of the workshop to remind people of the importance of these +basics.

+ +

Challenge 9 +


Consider the R output of the matrix below:



     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. +
  3. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  4. +
  5. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  6. +
  7. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
  8. +
+ +

Consider the R output of the matrix below:



     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.


R +

+matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
+ +

Keypoints +

  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +

Content from Exploring Data Frames


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I manipulate a data frame?
  • +


  • Add and remove rows or columns.
  • +
  • Append two data frames.
  • +
  • Display basic properties of data frames including size and class of +the columns, names, and first few rows.
  • +

At this point, you’ve seen it all: in the last lesson, we toured all +the basic data types and data structures in R. Everything you do will be +a manipulation of those tools. But most of the time, the star of the +show is the data frame—the table that we created by loading information +from a csv file. In this lesson, we’ll learn a few more things about +working with data frames.


Adding columns and rows in data frames +


We already learned that the columns of a data frame are vectors, so +that our data are consistent in type throughout the columns. As such, if +we want to add a new column, we can start by making a new vector:


R +

+age <- c(2, 3, 5)


    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1

We can then add this as a column via:


R +

+cbind(cats, age)


    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5

Note that if we tried to add a vector of ages with a different number +of entries than the number of rows in the data frame, it would fail:


R +

+age <- c(2, 3, 5, 12)
+cbind(cats, age)


Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4

R +

+age <- c(2, 3)
+cbind(cats, age)


Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2

Why didn’t this work? Of course, R wants to see one element in our +new column for every row in the table:


R +



[1] 3

R +



[1] 2

So for it to work we need to have nrow(cats) = +length(age). Let’s overwrite the content of cats with our +new data frame.


R +

+age <- c(2, 3, 5)
+cats <- cbind(cats, age)

Now how about adding rows? We already know that the rows of a data +frame are lists:


R +

+newRow <- list("tortoiseshell", 3.3, TRUE, 9)
+cats <- rbind(cats, newRow)

Let’s confirm that our new row was added correctly.


R +



           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9

Removing rows +


We now know how to add rows and columns to our data frame in R. Now +let’s learn to remove rows.


R +



           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9

We can ask for a data frame minus the last row:


R +

+cats[-4, ]


    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5

Notice the comma with nothing after it to indicate that we want to +drop the entire fourth row.


Note: we could also remove several rows at once by putting the row +numbers inside of a vector, for example: +cats[c(-3,-4), ]


Removing columns +


We can also remove columns in our data frame. What if we want to +remove the column “age”. We can remove it in two ways, by variable +number or by index.


R +



           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1

Notice the comma with nothing before it, indicating we want to keep +all of the rows.


Alternatively, we can drop the column by using the index name and the +%in% operator. The %in% operator goes through +each element of its left argument, in this case the names of +cats, and asks, “Does this element occur in the second +argument?”


R +

+drop <- names(cats) %in% c("age")


           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1

We will cover subsetting with logical operators like +%in% in more detail in the next episode. See the section Subsetting through other logical +operations


Appending to a data frame +


The key to remember when adding data to a data frame is that +columns are vectors and rows are lists. We can also glue two +data frames together with rbind:


R +

+cats <- rbind(cats, cats)


           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9

But now the row names are unnecessarily complicated. We can remove +the rownames, and R will automatically re-name them sequentially:


R +

+rownames(cats) <- NULL


           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+ +

Challenge 1 +


You can create a new data frame right from within R with the +following syntax:


R +

+df <- data.frame(id = c("a", "b", "c"),
+                 x = 1:3,
+                 y = c(TRUE, TRUE, FALSE))

Make a data frame that holds the following information for +yourself:

  • first name
  • +
  • last name
  • +
  • lucky number
  • +

Then use rbind to add an entry for the people sitting +beside you. Finally, use cbind to add a column with each +person’s answer to the question, “Is it time for coffee break?”

+ +

R +

+df <- data.frame(first = c("Grace"),
+                 last = c("Hopper"),
+                 lucky_number = c(0))
+df <- rbind(df, list("Marie", "Curie", 238) )
+df <- cbind(df, coffeetime = c(TRUE,TRUE))

Realistic example +


So far, you have seen the basics of manipulating data frames with our +cat data; now let’s use those skills to digest a more realistic dataset. +Let’s read in the gapminder dataset that we downloaded +previously:


R +

+gapminder <- read.csv("data/gapminder_data.csv")
+ +

Miscellaneous Tips +

  • Another type of file you might encounter are tab-separated value +files (.tsv). To specify a tab as a separator, use "\\t" or +read.delim().

  • +
  • Files can also be downloaded directly from the Internet into a +local folder of your choice onto your computer using the +download.file function. The read.csv function +can then be executed to read the downloaded file from the download +location, for example,

  • +

R +

+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv("data/gapminder_data.csv")
  • Alternatively, you can also read in files directly into R from the +Internet by replacing the file paths with a web address in +read.csv. One should note that in doing this no local copy +of the csv file is first saved onto your computer. For example,
  • +

R +

+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
  • You can read directly from excel spreadsheets without converting +them to plain text first by using the readxl +package.

  • +
  • The argument “stringsAsFactors” can be useful to tell R how to +read strings either as factors or as character strings. In R versions +after 4.0, all strings are read-in as characters by default, but in +earlier versions of R, strings are read-in as factors by default. For +more information, see the call-out in the +previous episode.

  • +

Let’s investigate gapminder a bit; the first thing we should always +do is check out what the data looks like with str:


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...

An additional method for examining the structure of gapminder is to +use the summary function. This function can be used on +various objects in R. For data frames, summary yields a +numeric, tabular, or descriptive summary of each column. Numeric or +integer columns are described by the descriptive statistics (quartiles +and mean), and character columns by its length, class, and mode.


R +



   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  

Along with the str and summary functions, +we can examine individual columns of the data frame with our +typeof function:


R +



[1] "integer"

R +



[1] "character"

R +



 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

We can also interrogate the data frame for information about its +dimensions; remembering that str(gapminder) said there were +1704 observations of 6 variables in gapminder, what do you think the +following will produce, and why?


R +



[1] 6

A fair guess would have been to say that the length of a data frame +would be the number of rows it has (1704), but this is not the case; +remember, a data frame is a list of vectors and factors:


R +



[1] "list"

When length gave us 6, it’s because gapminder is built +out of a list of 6 columns. To get the number of rows and columns in our +dataset, try:


R +



[1] 1704

R +



[1] 6

Or, both at once:


R +



[1] 1704    6

We’ll also likely want to know what the titles of all the columns +are, so we can ask for them later:


R +



[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

At this stage, it’s important to ask ourselves if the structure R is +reporting matches our intuition or expectations; do the basic data types +reported for each column make sense? If not, we need to sort any +problems out now before they turn into bad surprises down the road, +using what we’ve learned about how R interprets data, and the importance +of strict consistency in how we record our data.


Once we’re happy that the data types and structures seem reasonable, +it’s time to start digging into our data proper. Check out the first few +lines:


R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+ +

Challenge 2 +


It’s good practice to also check the last few lines of your data and +some in the middle. How would you do this?


Searching for ones specifically in the middle isn’t too hard, but we +could ask for a few lines at random. How would you code this?

+ +

To check the last few lines it’s relatively simple as R already has a +function for this:


R +

+tail(gapminder, n = 15)

What about a few arbitrary rows just in case something is odd in the +middle?


Tip: There are several ways to achieve this. +


The solution here presents one form of using nested functions, i.e. a +function passed as an argument to another function. This might sound +like a new concept, but you are already using it! Remember +my_dataframe[rows, cols] will print to screen your data frame with the +number of rows and columns you asked for (although you might have asked +for a range or named columns for example). How would you get the last +row if you don’t know how many rows your data frame has? R has a +function for this. What about getting a (pseudorandom) sample? R also +has a function for this.


R +

+gapminder[sample(nrow(gapminder), 5), ]

To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later.

+ +

Challenge 3 +


Go to file -> new file -> R script, and write an R script to +load in the gapminder dataset. Put it in the scripts/ +directory and add it to version control.


Run the script using the source function, using the file +path as its argument (or by pressing the “source” button in +RStudio).

+ +

The source function can be used to use a script within a +script. Assume you would like to load the same type of file over and +over again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again and +again you could just write it once and save it as a script. Then, you +can use source("Your_Script_containing_the_load_function") +in a new script to use the function of that script without writing +everything again. Check out ?source to find out more.


R +

+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv(file = "data/gapminder_data.csv")

To run the script and load the data into the gapminder +variable:


R +

+source(file = "scripts/load-gapminder.R")
+ +

Challenge 4 +


Read the output of str(gapminder) again; this time, use +what you’ve learned about lists and vectors, as well as the output of +functions like colnames and dim to explain +what everything that str prints out for gapminder means. If +there are any parts you can’t interpret, discuss with your +neighbors!

+ +

The object gapminder is a data frame with columns

  • +country and continent are character +strings.
  • +
  • +year is an integer vector.
  • +
  • +pop, lifeExp, and gdpPercap +are numeric vectors.
  • +
+ +

Keypoints +

  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +

Content from Subsetting Data


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I work with subsets of data in R?
  • +


  • To be able to subset vectors, factors, matrices, lists, and data +frames
  • +
  • To be able to extract individual and multiple elements: by index, by +name, using comparison operations
  • +
  • To be able to skip and remove elements from various data +structures.
  • +

R has many powerful subset operators. Mastering them will allow you +to easily perform complex operations on any kind of dataset.


There are six different ways we can subset any kind of object, and +three different subsetting operators for the different data +structures.


Let’s start with the workhorse of R: a simple numeric vector.


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+ +

Atomic vectors +


In R, simple vectors containing character strings, numbers, or +logical values are called atomic vectors because they can’t be +further simplified.


So now that we’ve created a dummy vector to play with, how do we get +at its contents?


Accessing elements using their indices +


To extract elements of a vector we can give their corresponding +index, starting from one:


R +




R +




It may look different, but the square brackets operator is a +function. For vectors (and matrices), it means “get me the nth +element”.


We can ask for multiple elements at once:


R +

+x[c(1, 3)]


  a   c 
+5.4 7.1 

Or slices of the vector:


R +



  a   b   c   d 
+5.4 6.2 7.1 4.8 

the : operator creates a sequence of numbers from the +left element to the right.


R +



[1] 1 2 3 4

R +

+c(1, 2, 3, 4)


[1] 1 2 3 4

We can ask for the same element multiple times:


R +



  a   a   c 
+5.4 5.4 7.1 

If we ask for an index beyond the length of the vector, R will return +a missing value:


R +



+  NA 

This is a vector of length one containing an NA, whose +name is also NA.


If we ask for the 0th element, we get an empty vector:


R +



named numeric(0)
+ +

Vector numbering in R starts at 1 +


In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1.


Skipping and removing elements +


If we use a negative number as the index of a vector, R will return +every element except for the one specified:


R +



  a   c   d   e 
+5.4 7.1 4.8 7.5 

We can skip multiple elements:


R +

+x[c(-1, -5)]  # or x[-c(1,5)]


  b   c   d 
+6.2 7.1 4.8 
+ +

Tip: Order of operations +


A common trip up for novices occurs when trying to skip slices of a +vector. It’s natural to try to negate a sequence like so:


R +


This gives a somewhat cryptic error:



Error in x[-1:3]: only 0's may be mixed with negative subscripts

But remember the order of operations. : is really a +function. It takes its first argument as -1, and its second as 3, so +generates the sequence of numbers: c(-1, 0, 1, 2, 3).


The correct solution is to wrap that function call in brackets, so +that the - operator applies to the result:


R +



  d   e 
+4.8 7.5 

To remove elements from a vector, we need to assign the result back +into the variable:


R +

+x <- x[-4]


  a   b   c   e 
+5.4 6.2 7.1 7.5 
+ +

Challenge 1 +


Given the following code:


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 

Come up with at least 2 different commands that will produce the +following output:



  b   c   d 
+6.2 7.1 4.8 

After you find 2 different commands, compare notes with your +neighbour. Did you have different strategies?

+ +

R +



  b   c   d 
+6.2 7.1 4.8 

R +



  b   c   d 
+6.2 7.1 4.8 

R +



  b   c   d 
+6.2 7.1 4.8 

Subsetting by name +


We can extract elements by using their name, instead of extracting by +index:


R +

+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
+x[c("a", "c")]


  a   c 
+5.4 7.1 

This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same!


Subsetting through other logical operations +


We can also use any logical vector to subset:


R +



  c   e 
+7.1 7.5 

Since comparison operators (e.g. >, +<, ==) evaluate to logical vectors, we can +also use them to succinctly subset vectors: the following statement +gives the same result as the previous one.


R +

+x[x > 7]


  c   e 
+7.1 7.5 

Breaking it down, this statement first evaluates x>7, +generating a logical vector +c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the +elements of x corresponding to the TRUE +values.


We can use == to mimic the previous method of indexing +by name (remember you have to use == rather than += for comparisons):


R +

+x[names(x) == "a"]


+ +

Tip: Combining logical conditions +


We often want to combine multiple logical criteria. For example, we +might want to find all the countries that are located in Asia +or Europe and have life expectancies +within a certain range. Several operations for combining logical vectors +exist in R:

  • +&, the “logical AND” operator: returns +TRUE if both the left and right are TRUE.
  • +
  • +|, the “logical OR” operator: returns +TRUE, if either the left or right (or both) are +TRUE.
  • +

You may sometimes see && and || +instead of & and |. These two-character +operators only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them for programming, i.e. deciding +whether to execute a statement.

  • +!, the “logical NOT” operator: converts +TRUE to FALSE and FALSE to +TRUE. It can negate a single logical condition (eg +!TRUE becomes FALSE), or a whole vector of +conditions(eg !c(TRUE, FALSE) becomes +c(FALSE, TRUE)).
  • +

Additionally, you can compare the elements within a single vector +using the all function (which returns TRUE if +every element of the vector is TRUE) and the +any function (which returns TRUE if one or +more elements of the vector are TRUE).

+ +

Challenge 2 +


Given the following code:


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 

Write a subsetting command to return the values in x that are greater +than 4 and less than 7.

+ +

R +

+x_subset <- x[x<7 & x>4]


  a   b   d 
+5.4 6.2 4.8 
+ +

Tip: Non-unique names +


You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have the +same name — although R tries to avoid this — but row names must be +unique.) Consider these examples:


R +

+x <- 1:3


[1] 1 2 3

R +

+names(x) <- c('a', 'a', 'a')


a a a 
+1 2 3 

R +

+x['a']  # only returns first value



R +

+x[names(x) == 'a']  # returns all three values


a a a 
+1 2 3 
+ +

Tip: Getting help for operators +


Remember you can search for help on operators by wrapping them in +quotes: help("%in%") or ?"%in%".


Skipping named elements +


Skipping or removing named elements is a little harder. If we try to +skip one named element by negating the string, R complains (slightly +obscurely) that it doesn’t know how to take the negative of a +string:


R +

+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'


Error in -"a": invalid argument to unary operator

However, we can use the != (not-equals) operator to +construct a logical vector that will do what we want:


R +

+x[names(x) != "a"]


  b   c   d   e 
+6.2 7.1 4.8 7.5 

Skipping multiple named indices is a little bit harder still. Suppose +we want to drop the "a" and "c" elements, so +we try this:


R +



Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length


  b   c   d   e 
+6.2 7.1 4.8 7.5 

R did something, but it gave us a warning that we ought to +pay attention to - and it apparently gave us the wrong answer +(the "c" element is still included in the vector)!


So what does != actually do in this case? That’s an +excellent question.


Recycling +


Let’s take a look at the comparison component of this code:


R +

+names(x) != c("a", "c")


Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length



Why does R give TRUE as the third element of this +vector, when names(x)[3] != "c" is obviously false? When +you use !=, R tries to compare each element of the left +argument with the corresponding element of its right argument. What +happens when you compare vectors of different lengths?

Inequality testing

When one vector is shorter than the other, it gets +recycled:

Inequality testing: results of recycling

In this case R repeats c("a", "c") as +many times as necessary to match names(x), i.e. we get +c("a","c","a","c","a"). Since the recycled "a" +doesn’t match the third element of names(x), the value of +!= is TRUE. Because in this case the longer +vector length (5) isn’t a multiple of the shorter vector length (2), R +printed a warning message. If we had been unlucky and +names(x) had contained six elements, R would +silently have done the wrong thing (i.e., not what we intended +it to do). This recycling rule can can introduce hard-to-find and subtle +bugs!


The way to get R to do what we really want (match each +element of the left argument with all of the elements of the +right argument) it to use the %in% operator. The +%in% operator goes through each element of its left +argument, in this case the names of x, and asks, “Does this +element occur in the second argument?”. Here, since we want to +exclude values, we also need a ! operator to +change “in” to “not in”:


R +

+x[! names(x) %in% c("a","c") ]


  b   d   e 
+6.2 4.8 7.5 
+ +

Challenge 3 +


Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains country and continent variables, but +no information between these two scales. Suppose we want to pull out +information from southeast Asia: how do we set up an operation to +produce a logical vector that is TRUE for all of the +countries in southeast Asia and FALSE otherwise?


Suppose you have these data:


R +

+seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries <- unique(as.character(gapminder$country))

There’s a wrong way (using only ==), which will give you +a warning; a clunky way (using the logical operators == and +|); and an elegant way (using %in%). See +whether you can come up with all three and explain how they (don’t) +work.

+ +
  • The wrong way to do this problem is +countries==seAsia. This gives a warning +("In countries == seAsia : longer object length is not a multiple of shorter object length") +and the wrong answer (a vector of all FALSE values), +because none of the recycled values of seAsia happen to +line up correctly with matching values in country.
  • +
  • The clunky (but technically correct) way to do this +problem is
  • +

R +

+ (countries=="Myanmar" | countries=="Thailand" |
+ countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")

(or countries==seAsia[1] | countries==seAsia[2] | ...). +This gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?).

  • The best way to do this problem is +countries %in% seAsia, which is both correct and easy to +type (and read).
  • +

Handling special values +


At some point you will encounter functions in R that cannot handle +missing, infinite, or undefined data.


There are a number of special functions you can use to filter out +this data:

  • +is.na will return all positions in a vector, matrix, or +data.frame containing NA (or NaN)
  • +
  • likewise, is.nan, and is.infinite will do +the same for NaN and Inf.
  • +
  • +is.finite will return all positions in a vector, +matrix, or data.frame that do not contain NA, +NaN or Inf.
  • +
  • +na.omit will filter out all missing values from a +vector
  • +

Factor subsetting +


Now that we’ve explored the different ways to subset vectors, how do +we subset the other data structures?


Factor subsetting works the same way as vector subsetting.


R +

+f <- factor(c("a", "a", "b", "c", "c", "d"))
+f[f == "a"]


[1] a a
+Levels: a b c d

R +

+f[f %in% c("b", "c")]


[1] b c c
+Levels: a b c d

R +



[1] a a b
+Levels: a b c d

Skipping elements will not remove the level even if no more of that +category exists in the factor:


R +



[1] a a c c d
+Levels: a b c d

Matrix subsetting +


Matrices are also subsetted using the [ function. In +this case it takes two arguments: the first applying to the rows, the +second to its columns:


R +

+m <- matrix(rnorm(6*4), ncol=4, nrow=6)
+m[3:4, c(3,1)]


            [,1]       [,2]
+[1,]  1.12493092 -0.8356286
+[2,] -0.04493361  1.5952808

You can leave the first or second arguments blank to retrieve all the +rows or columns respectively:


R +

+m[, c(3,4)]


            [,1]        [,2]
+[1,] -0.62124058  0.82122120
+[2,] -2.21469989  0.59390132
+[3,]  1.12493092  0.91897737
+[4,] -0.04493361  0.78213630
+[5,] -0.01619026  0.07456498
+[6,]  0.94383621 -1.98935170

If we only access one row or column, R will automatically convert the +result to a vector:


R +



[1] -0.8356286  0.5757814  1.1249309  0.9189774

If you want to keep the output as a matrix, you need to specify a +third argument; drop = FALSE:


R +

+m[3, , drop=FALSE]


           [,1]      [,2]     [,3]      [,4]
+[1,] -0.8356286 0.5757814 1.124931 0.9189774

Unlike vectors, if we try to access a row or column outside of the +matrix, R will throw an error:


R +

+m[, c(3,6)]


Error in m[, c(3, 6)]: subscript out of bounds
+ +

Tip: Higher dimensional arrays +


when dealing with multi-dimensional arrays, each argument to +[ corresponds to a dimension. For example, a 3D array, the +first three arguments correspond to the rows, columns, and depth +dimension.


Because matrices are vectors, we can also subset using only one +argument:


R +



[1] 0.3295078

This usually isn’t useful, and often confusing to read. However it is +useful to note that matrices are laid out in column-major +format by default. That is the elements of the vector are arranged +column-wise:


R +

+matrix(1:6, nrow=2, ncol=3)


     [,1] [,2] [,3]
+[1,]    1    3    5
+[2,]    2    4    6

If you wish to populate the matrix by row, use +byrow=TRUE:


R +

+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)


     [,1] [,2] [,3]
+[1,]    1    2    3
+[2,]    4    5    6

Matrices can also be subsetted using their rownames and column names +instead of their row and column indices.

+ +

Challenge 4 +


Given the following code:


R +

+m <- matrix(1:18, nrow=3, ncol=6)


     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    1    4    7   10   13   16
+[2,]    2    5    8   11   14   17
+[3,]    3    6    9   12   15   18
  1. Which of the following commands will extract the values 11 and +14?
  2. +

A. m[2,4,2,5]


B. m[2:5]


C. m[4:5,2]


D. m[2,c(4,5)]

+ +



List subsetting +


Now we’ll introduce some new subsetting operators. There are three +functions used to subset lists. We’ve already seen these when learning +about atomic vectors and matrices: [, [[, and +$.


Using [ will always return a list. If you want to +subset a list, but not extract an element, then you +will likely use [.


R +

+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))


+[1] "Software Carpentry"

This returns a list with one element.


We can subset elements of a list exactly the same way as atomic +vectors using [. Comparison operations however won’t work +as they’re not recursive, they will try to condition on the data +structures in each element of the list, not the individual elements +within those data structures.


R +



+[1] "Software Carpentry"
+ [1]  1  2  3  4  5  6  7  8  9 10

To extract individual elements of a list, you need to use the +double-square bracket function: [[.


R +



[1] "Software Carpentry"

Notice that now the result is a vector, not a list.


You can’t extract more than one element at once:


R +



Error in xlist[[1:2]]: subscript out of bounds

Nor use it to skip elements:


R +



Error in xlist[[-1]]: invalid negative subscript in get1index <real>

But you can use names to both subset and extract elements:


R +



[1] "Software Carpentry"

The $ function is a shorthand way for extracting +elements by name:


R +



                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
+Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+ +

Challenge 5 +


Given the following list:


R +

+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))

Using your knowledge of both list and vector subsetting, extract the +number 2 from xlist. Hint: the number 2 is contained within the “b” item +in the list.

+ +

R +



[1] 2

R +



[1] 2

R +



[1] 2
+ +

Challenge 6 +


Given a linear model:


R +

+mod <- aov(pop ~ lifeExp, data=gapminder)

Extract the residual degrees of freedom (hint: +attributes() will help you)

+ +

R +

+attributes(mod) ## `df.residual` is one of the names of `mod`

R +


Data frames +


Remember the data frames are lists underneath the hood, so similar +rules apply. However they are also two dimensional objects:


[ with one argument will act the same way as for lists, +where each list element corresponds to a column. The resulting object +will be a data frame:


R +



+1  8425333
+2  9240934
+3 10267083
+4 11537966
+5 13079460
+6 14880372

Similarly, [[ will act to extract a single +column:


R +



[1] 28.801 30.332 31.997 34.020 36.088 38.438

And $ provides a convenient shorthand to extract columns +by name:


R +



[1] 1952 1957 1962 1967 1972 1977

With two arguments, [ behaves the same way as for +matrices:


R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007

If we subset a single row, the result will be a data frame (because +the elements are mixed types):


R +



      country year      pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007

But for a single column the result will be a vector (this can be +changed with the third argument, drop = FALSE).

+ +

Challenge 7 +


Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
  2. +

R +

gapminder[gapminder$year = 1957,]
  1. Extract all columns except 1 through to 4
  2. +

R +

  1. Extract the rows where the life expectancy is longer the 80 +years
  2. +

R +

+gapminder[gapminder$lifeExp > 80]
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +

R +

+gapminder[1, 4, 5]
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +

R +

+gapminder[gapminder$year == 2002 | 2007,]
+ +

Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
  2. +

R +

+# gapminder[gapminder$year = 1957,]
+gapminder[gapminder$year == 1957,]
  1. Extract all columns except 1 through to 4
  2. +

R +

+# gapminder[,-1:4]
  1. Extract the rows where the life expectancy is longer than 80 +years
  2. +

R +

+# gapminder[gapminder$lifeExp > 80]
+gapminder[gapminder$lifeExp > 80,]
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +

R +

+# gapminder[1, 4, 5]
+gapminder[1, c(4, 5)]
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +

R +

+# gapminder[gapminder$year == 2002 | 2007,]
+gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
+gapminder[gapminder$year %in% c(2002, 2007),]
+ +

Challenge 8 +

  1. Why does gapminder[1:20] return an error? How does +it differ from gapminder[1:20, ]?

  2. +
  3. Create a new data.frame called +gapminder_small that only contains rows 1 through 9 and 19 +through 23. You can do this in one or two steps.

  4. +
+ +
  1. gapminder is a data.frame so needs to be subsetted +on two dimensions. gapminder[1:20, ] subsets the data to +give the first 20 rows and all columns.

  2. +
  3. +
  4. +

R +

+gapminder_small <- gapminder[c(1:9, 19:23),]
+ +

Keypoints +

  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +

Content from Control Flow


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I make data-dependent choices in R?
  • +
  • How can I repeat operations in R?
  • +


  • Write conditional statements with if...else statements +and ifelse().
  • +
  • Write and understand for() loops.
  • +

Often when we’re coding we want to control the flow of our actions. +This can be done by setting actions to occur only if a condition or a +set of conditions are met. Alternatively, we can also set an action to +occur a particular number of times.


There are several ways you can control flow in R. For conditional +statements, the most commonly used approaches are the constructs:


R +

# if
+if (condition is true) {
+  perform action
+# if ... else
+if (condition is true) {
+  perform action
+} else {  # that is, if the condition is false,
+  perform alternative action

Say, for example, that we want R to print a message if a variable +x has a particular value:


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")


[1] 8

The print statement does not appear in the console because x is not +greater than 10. To print a different message for numbers less than 10, +we can add an else statement.


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else {
+  print("x is less than 10")


[1] "x is less than 10"

You can also test multiple conditions by using +else if.


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else if (x > 5) {
+  print("x is greater than 5, but less than 10")
+} else {
+  print("x is less than 5")


[1] "x is greater than 5, but less than 10"

Important: when R evaluates the condition inside +if() statements, it is looking for a logical element, i.e., +TRUE or FALSE. This can cause some headaches +for beginners. For example:


R +

+x  <-  4 == 3
+if (x) {
+  "4 equals 3"
+} else {
+  "4 does not equal 3"


[1] "4 does not equal 3"

As we can see, the not equal message was printed because the vector x +is FALSE


R +

+x <- 4 == 3


+ +

Challenge 1 +


Use an if() statement to print a suitable message +reporting whether there are any records from 2002 in the +gapminder dataset. Now do the same for 2012.

+ +

We will first see a solution to Challenge 1 which does not use the +any() function. We first obtain a logical vector describing +which element of gapminder$year is equal to +2002:


R +

+gapminder[(gapminder$year == 2002),]

Then, we count the number of rows of the data.frame +gapminder that correspond to the 2002:


R +

+rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])

The presence of any record for the year 2002 is equivalent to the +request that rows2002_number is one or more:


R +

+rows2002_number >= 1

Putting all together, we obtain:


R +

+if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
+   print("Record(s) for the year 2002 found.")

All this can be done more quickly with any(). The +logical condition can be expressed as:


R +

+if(any(gapminder$year == 2002)){
+   print("Record(s) for the year 2002 found.")

Did anyone get a warning message like this?



Error in if (gapminder$year == 2012) {: the condition has length > 1

The if() function only accepts singular (of length 1) +inputs, and therefore returns an error when you use it with a vector. +The if() function will still run, but will only evaluate +the condition in the first element of the vector. Therefore, to use the +if() function, you need to make sure your input is singular +(of length 1).

+ +

Tip: Built in ifelse() +function +


R accepts both if() and +else if() statements structured as outlined above, but also +statements using R’s built-in ifelse() +function. This function accepts both singular and vector inputs and is +structured as follows:


R +

# ifelse function
+ifelse(condition is true, perform action, perform alternative action)

where the first argument is the condition or a set of conditions to +be met, the second argument is the statement that is evaluated when the +condition is TRUE, and the third statement is the statement +that is evaluated when the condition is FALSE.


R +

+y <- -3
+ifelse(y < 0, "y is a negative number", "y is either positive or zero")


[1] "y is a negative number"
+ +

Tip: any() and +all() +


The any() function will return TRUE if at +least one TRUE value is found within a vector, otherwise it +will return FALSE. This can be used in a similar way to the +%in% operator. The function all(), as the name +suggests, will only return TRUE if all values in the vector +are TRUE.


Repeating operations +


If you want to iterate over a set of values, when the order of +iteration is important, and perform the same operation on each, a +for() loop will do the job. We saw for() loops +in the shell +lessons earlier. This is the most flexible of looping operations, +but therefore also the hardest to use correctly. In general, the advice +of many R users would be to learn about for() +loops, but to avoid using for() loops unless the order of +iteration is important: i.e. the calculation at each iteration depends +on the results of previous iterations. If the order of iteration is not +important, then you should learn about vectorized alternatives, such as +the purrr package, as they pay off in computational +efficiency.


The basic structure of a for() loop is:


R +

for (iterator in set of values) {
+  do a thing

For example:


R +

+for (i in 1:10) {
+  print(i)


[1] 1
+[1] 2
+[1] 3
+[1] 4
+[1] 5
+[1] 6
+[1] 7
+[1] 8
+[1] 9
+[1] 10

The 1:10 bit creates a vector on the fly; you can +iterate over any other vector as well.


We can use a for() loop nested within another +for() loop to iterate over two things at once.


R +

+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    print(paste(i,j))
+  }


[1] "1 a"
+[1] "1 b"
+[1] "1 c"
+[1] "1 d"
+[1] "1 e"
+[1] "2 a"
+[1] "2 b"
+[1] "2 c"
+[1] "2 d"
+[1] "2 e"
+[1] "3 a"
+[1] "3 b"
+[1] "3 c"
+[1] "3 d"
+[1] "3 e"
+[1] "4 a"
+[1] "4 b"
+[1] "4 c"
+[1] "4 d"
+[1] "4 e"
+[1] "5 a"
+[1] "5 b"
+[1] "5 c"
+[1] "5 d"
+[1] "5 e"

We notice in the output that when the first index (i) is +set to 1, the second index (j) iterates through its full +set of indices. Once the indices of j have been iterated +through, then i is incremented. This process continues +until the last index has been used for each for() loop.


Rather than printing the results, we could write the loop output to a +new object.


R +

+output_vector <- c()
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    temp_output <- paste(i, j)
+    output_vector <- c(output_vector, temp_output)
+  }


 [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b"
+[13] "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b" "5 c" "5 d"
+[25] "5 e"

This approach can be useful, but ‘growing your results’ (building the +result object incrementally) is computationally inefficient, so avoid it +when you are iterating through a lot of values.

+ +

Tip: don’t grow your results +


One of the biggest things that trips up novices and experienced R +users alike, is building a results object (vector, list, matrix, data +frame) as your for loop progresses. Computers are very bad at handling +this, so your calculations can very quickly slow to a crawl. It’s much +better to define an empty results object before hand of appropriate +dimensions, rather than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location.


A better way is to define your (empty) output object before filling +in the values. For this example, it looks more involved, but is still +more efficient.


R +

+output_matrix <- matrix(nrow = 5, ncol = 5)
+j_vector <- c('a', 'b', 'c', 'd', 'e')
+for (i in 1:5) {
+  for (j in 1:5) {
+    temp_j_value <- j_vector[j]
+    temp_output <- paste(i, temp_j_value)
+    output_matrix[i, j] <- temp_output
+  }
+output_vector2 <- as.vector(output_matrix)


 [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c" "2 c"
+[13] "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e" "3 e" "4 e"
+[25] "5 e"
+ +

Tip: While loops +


Sometimes you will find yourself needing to repeat an operation as +long as a certain condition is met. You can do this with a +while() loop.


R +

while(this condition is true){
+  do a thing

R will interpret a condition being met as “TRUE”.


As an example, here’s a while loop that generates random numbers from +a uniform distribution (the runif() function) between 0 and +1 until it gets one that’s less than 0.1.


R +

+z <- 1
+while(z > 0.1){
+  z <- runif(1)
+  cat(z, "\n")

while() loops will not always be appropriate. You have +to be particularly careful that you don’t end up stuck in an infinite +loop because your condition is always met and hence the while statement +never terminates.

+ +

Challenge 2 +


Compare the objects output_vector and +output_vector2. Are they the same? If not, why not? How +would you change the last block of code to make +output_vector2 the same as output_vector?

+ +

We can check whether the two vectors are identical using the +all() function:


R +

+all(output_vector == output_vector2)

However, all the elements of output_vector can be found +in output_vector2:


R +

+all(output_vector %in% output_vector2)

and vice versa:


R +

+all(output_vector2 %in% output_vector)

therefore, the element in output_vector and +output_vector2 are just sorted in a different order. This +is because as.vector() outputs the elements of an input +matrix going over its column. Taking a look at +output_matrix, we can notice that we want its elements by +rows. The solution is to transpose the output_matrix. We +can do it either by calling the transpose function t() or +by inputting the elements in the right order. The first solution +requires to change the original


R +

+output_vector2 <- as.vector(output_matrix)



R +

+output_vector2 <- as.vector(t(output_matrix))

The second solution requires to change


R +

+output_matrix[i, j] <- temp_output



R +

+output_matrix[j, i] <- temp_output
+ +

Challenge 3 +


Write a script that loops through the gapminder data by +continent and prints out whether the mean life expectancy is smaller or +larger than 50 years.

+ +

Step 1: We want to make sure we can extract all the +unique values of the continent vector


R +

+gapminder <- read.csv("data/gapminder_data.csv")

Step 2: We also need to loop over each of these +continents and calculate the average life expectancy for each +subset of data. We can do that as follows:

  1. Loop over each of the unique values of ‘continent’
  2. +
  3. For each value of continent, create a temporary variable storing +that subset
  4. +
  5. Return the calculated life expectancy to the user by printing the +output:
  6. +

R +

+for (iContinent in unique(gapminder$continent)) {
+  tmp <- gapminder[gapminder$continent == iContinent, ]
+  cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n")
+  rm(tmp)

Step 3: The exercise only wants the output printed +if the average life expectancy is less than 50 or greater than 50. So we +need to add an if() condition before printing, which +evaluates whether the calculated average life expectancy is above or +below a threshold, and prints an output conditional on the result. We +need to amend (3) from above:


3a. If the calculated life expectancy is less than some threshold (50 +years), return the continent and a statement that life expectancy is +less than threshold, otherwise return the continent and a statement that +life expectancy is greater than threshold:


R +

+thresholdValue <- 50
+for (iContinent in unique(gapminder$continent)) {
+   tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"])
+   if (tmp < thresholdValue){
+       cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+   } else {
+       cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+   } # end if else condition
+   rm(tmp)
+} # end for loop
+ +

Challenge 4 +


Modify the script from Challenge 3 to loop over each country. This +time print out whether the life expectancy is smaller than 50, between +50 and 70, or greater than 70.

+ +

We modify our solution to Challenge 3 by now adding two thresholds, +lowerThreshold and upperThreshold and +extending our if-else statements:


R +

+ lowerThreshold <- 50
+ upperThreshold <- 70
+for (iCountry in unique(gapminder$country)) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+    if(tmp < lowerThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
+    } else if(tmp > lowerThreshold && tmp < upperThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
+    } else {
+        cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
+    }
+    rm(tmp)
+ +

Challenge 5 - Advanced +


Write a script that loops over each country in the +gapminder dataset, tests whether the country starts with a +‘B’, and graphs life expectancy against time as a line graph if the mean +life expectancy is under 50 years.

+ +

We will use the grep() command that was introduced in +the Unix +Shell lesson to find countries that start with “B.” Lets understand +how to do this first. Following from the Unix shell section we may be +tempted to try the following


R +

+grep("^B", unique(gapminder$country))

But when we evaluate this command it returns the indices of the +factor variable country that start with “B.” To get the +values, we must add the value=TRUE option to the +grep() command:


R +

+grep("^B", unique(gapminder$country), value = TRUE)

We will now store these countries in a variable called +candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each +country, and if the average life expectancy is less than 50 we use +base-plot to plot the evolution of average life expectancy using +with() and subset():


R +

+thresholdValue <- 50
+candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE)
+for (iCountry in candidateCountries) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+    if (tmp < thresholdValue) {
+        cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+        with(subset(gapminder, country == iCountry),
+                plot(year, lifeExp,
+                     type = "o",
+                     main = paste("Life Expectancy in", iCountry, "over time"),
+                     ylab = "Life Expectancy",
+                     xlab = "Year"
+                     ) # end plot
+             ) # end with
+    } # end if
+    rm(tmp)
+} # end for loop
+ +

Keypoints +

  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +

Content from Creating Publication-Quality Graphics with ggplot2


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I create publication-quality graphics in R?
  • +


  • To be able to use ggplot2 to generate publication-quality +graphics.
  • +
  • To apply geometry, aesthetic, and statistics layers to a ggplot +plot.
  • +
  • To manipulate the aesthetics of a plot using different colors, +shapes, and lines.
  • +
  • To improve data visualization through transforming scales and +paneling by group.
  • +
  • To save a plot created with ggplot to disk.
  • +

Plotting our data is one of the best ways to quickly explore it and +the various relationships between variables.


There are three main plotting systems in R, the base plotting +system, the lattice +package, and the ggplot2 +package.


Today we’ll be learning about the ggplot2 package, because it is the +most effective for creating publication-quality graphics.


ggplot2 is built on the grammar of graphics, the idea that any plot +can be built from the same set of components: a data +set, mapping aesthetics, and graphical +layers:

  • Data sets are the data that you, the user, +provide.

  • +
  • Mapping aesthetics are what connect the data to +the graphics. They tell ggplot2 how to use your data to affect how the +graph looks, such as changing what is plotted on the X or Y axis, or the +size or color of different data points.

  • +
  • Layers are the actual graphical output from +ggplot2. Layers determine what kinds of plot are shown (scatterplot, +histogram, etc.), the coordinate system used (rectangular, polar, +others), and other important aspects of the plot. The idea of layers of +graphics may be familiar to you if you have used image editing programs +like Photoshop, Illustrator, or Inkscape.

  • +

Let’s start off building an example using the gapminder data from +earlier. The most basic function is ggplot, which lets R +know that we’re creating a new plot. Any of the arguments we give the +ggplot function are the global options for the +plot: they apply to all layers on the plot.


R +

+ggplot(data = gapminder)
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to +show on our figure. This is not enough information for +ggplot to actually draw anything. It only creates a blank +slate for other elements to be added to.


Now we’re going to add in the mapping aesthetics +using the aes function. aes tells +ggplot how variables in the data map to +aesthetic properties of the figure, such as which columns of +the data should be used for the x and +y locations.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

Here we told ggplot we want to plot the “gdpPercap” +column of the gapminder data frame on the x-axis, and the “lifeExp” +column on the y-axis. Notice that we didn’t need to explicitly pass +aes these columns +(e.g. x = gapminder[, "gdpPercap"]), this is because +ggplot is smart enough to know to look in the +data for that column!


The final part of making our plot is to tell ggplot how +we want to visually represent the data. We do this by adding a new +layer to the plot using one of the +geom functions.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

Here we used geom_point, which tells ggplot +we want to visually represent the relationship between +x and y as a scatterplot of +points.

+ +

Challenge 1 +


Modify the example so that the figure shows how life expectancy has +changed over time:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()

Hint: the gapminder dataset has a column called “year”, which should +appear on the x-axis.

+ +

Here is one possible solution:


R +

+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +
+ +

Challenge 2 +


In the previous examples and challenge we’ve used the +aes function to tell the scatterplot geom +about the x and y locations of each +point. Another aesthetic property we can modify is the point +color. Modify the code from the previous challenge to +color the points by the “continent” column. What trends +do you see in the data? Are they what you expected?

+ +

The solution presented below adds color=continent to the +call of the aes function. The general trend seems to +indicate an increased life expectancy over the years. On continents with +stronger economies we find a longer life expectancy.


R +

+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_point()
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +

Layers +


Using a scatterplot probably isn’t the best for visualizing change +over time. Instead, let’s tell ggplot to visualize the data +as a line plot:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
+  geom_line()

Instead of adding a geom_point layer, we’ve added a +geom_line layer.


However, the result doesn’t look quite as we might have expected: it +seems to be jumping around a lot in each continent. Let’s try to +separate the data by country, plotting one line for each country:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line()

We’ve added the group aesthetic, which +tells ggplot to draw a line for each country.


But what if we want to visualize both lines and points on the plot? +We can add another layer to the plot:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line() + geom_point()

It’s important to note that each layer is drawn on top of the +previous layer. In this example, the points have been drawn on top +of the lines. Here’s a demonstration:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+  geom_line(mapping = aes(color=continent)) + geom_point()

In this example, the aesthetic mapping of +color has been moved from the global plot options in +ggplot to the geom_line layer so it no longer +applies to the points. Now we can clearly see that the points are drawn +on top of the lines.

+ +

Tip: Setting an aesthetic to a value instead +of a mapping +


So far, we’ve seen how to use an aesthetic (such as +color) as a mapping to a variable in the data. +For example, when we use +geom_line(mapping = aes(color=continent)), ggplot will give +a different color to each continent. But what if we want to change the +color of all lines to blue? You may think that +geom_line(mapping = aes(color="blue")) should work, but it +doesn’t. Since we don’t want to create a mapping to a specific variable, +we can move the color specification outside of the aes() +function, like this: geom_line(color="blue").

+ +

Challenge 3 +


Switch the order of the point and line layers from the previous +example. What happened?

+ +

The lines now get drawn over the points!


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_point() + geom_line(mapping = aes(color=continent))
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

Transformations and statistics +


ggplot2 also makes it easy to overlay statistical models over the +data. To demonstrate we’ll go back to our first example:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()

Currently it’s hard to see the relationship between the points due to +some strong outliers in GDP per capita. We can change the scale of units +on the x axis using the scale functions. These control the +mapping between the data values and visual values of an aesthetic. We +can also modify the transparency of the points, using the alpha +function, which is especially helpful when you have a large amount of +data which is very clustered.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10()
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

The scale_x_log10 function applied a transformation to +the coordinate system of the plot, so that each multiple of 10 is evenly +spaced from left to right. For example, a GDP per capita of 1,000 is the +same horizontal distance away from a value of 10,000 as the 10,000 value +is from 100,000. This helps to visualize the spread of the data along +the x-axis.

+ +

Tip Reminder: Setting an aesthetic to a value +instead of a mapping +


Notice that we used geom_point(alpha = 0.5). As the +previous tip mentioned, using a setting outside of the +aes() function will cause this value to be used for all +points, which is what we want in this case. But just like any other +aesthetic setting, alpha can also be mapped to a variable in +the data. For example, we can give a different transparency to each +continent with +geom_point(mapping = aes(alpha = continent)).


We can fit a simple relationship to the data by adding another layer, +geom_smooth:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the +size aesthetic in the geom_smooth +layer:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", size=1.5)


Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we +set the size aesthetic by passing it as an +argument to geom_smooth. Previously in the lesson we’ve +used the aes function to define a mapping between +data variables and their visual representation.

+ +

Challenge 4a +


Modify the color and size of the points on the point layer in the +previous example.


Hint: do not use the aes function.

+ +

Here a possible solution: Notice that the color argument +is supplied outside of the aes() function. This means that +it applies to all data points on the graph and is not related to a +specific variable.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(size=3, color="orange") + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+ +

Challenge 4b +


Modify your solution to Challenge 4a so that the points are now a +different shape and are colored by continent with new trendlines. Hint: +The color argument can be used inside the aesthetic.

+ +

Here is a possible solution: Notice that supplying the +color argument inside the aes() functions +enables you to connect it to a certain variable. The shape +argument, as you can see, modifies all data points the same way (it is +outside the aes() call) while the color +argument which is placed inside the aes() call modifies a +point’s color based on its continent value.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
+ geom_point(size=3, shape=17) + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)


`geom_smooth()` using formula = 'y ~ x'

Multi-panel figures +


Earlier we visualized the change in life expectancy over time across +all countries in one plot. Alternatively, we can split this out over +multiple panels by adding a layer of facet panels.

+ +

Tip +


We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to clutter +the figure. Note that we apply a “theme” definition to rotate the x-axis +labels to maintain readability. Nearly everything in ggplot2 is +customizable.


R +

+americas <- gapminder[gapminder$continent == "Americas",]
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

The facet_wrap layer took a “formula” as its argument, +denoted by the tilde (~). This tells R to draw a panel for each unique +value in the country column of the gapminder dataset.


Modifying text +


To clean this figure up for a publication we need to change some of +the text elements. The x-axis is too cluttered, and the y axis should +read “Life expectancy”, rather than the column name in the data +frame.


We can do this by adding a couple of different layers. The +theme layer controls the axis text, and overall text +size. Labels for the axes, plot title and any legend can be set using +the labs function. Legend titles are set using the same +names we used in the aes specification. Thus below the +color legend title is set using color = "Continent", while +the title of a fill legend would be set using +fill = "MyTitle".


R +

+ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Exporting the plot +


The ggsave() function allows you to export a plot +created with ggplot. You can specify the dimension and resolution of +your plot by adjusting the appropriate arguments (width, +height and dpi) to create high quality +graphics for publication. In order to save the plot from above, we first +assign it to a variable lifeExp_plot, then tell +ggsave to save that plot in png format to a +directory called results. (Make sure you have a +results/ folder in your working directory.)


R +

+lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")

There are two nice things about ggsave. First, it +defaults to the last plot, so if you omit the plot argument +it will automatically save the last plot you created with +ggplot. Secondly, it tries to determine the format you want +to save your plot in from the file extension you provide for the +filename (for example .png or .pdf). If you +need to, you can specify the format explicitly in the +device argument.


This is a taste of what you can do with ggplot2. RStudio provides a +really useful cheat +sheet of the different layers available, and more extensive +documentation is available on the ggplot2 website. All +RStudio cheat sheets can be found here. Finally, +if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow +with reusable code to modify!

+ +

Challenge 5 +


Generate boxplots to compare life expectancy between the different +continents during the available years.



  • Rename y axis as Life Expectancy.
  • +
  • Remove x axis labels.
  • +
+ +

Here a possible solution: xlab() and ylab() +set labels for the x and y axes, respectively The axis title, text and +ticks are attributes of the theme and must be modified within a +theme() call.


R +

+ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
+ geom_boxplot() + facet_wrap(~year) +
+ ylab("Life Expectancy") +
+ theme(axis.title.x=element_blank(),
+       axis.text.x = element_blank(),
+       axis.ticks.x = element_blank())
+ +

Keypoints +

  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +

Content from Vectorization


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I operate on all the elements of a vector at once?
  • +


  • To understand vectorized operations in R.
  • +

Most of R’s functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through and +act on each element one at a time. This makes writing code more concise, +easy to read, and less error prone.


R +

+x <- 1:4
+x * 2


[1] 2 4 6 8

The multiplication happened to each element of the vector.


We can also add two vectors together:


R +

+y <- 6:9
+x + y


[1]  7  9 11 13

Each element of x was added to its corresponding element +of y:


R +

x:  1  2  3  4
+    +  +  +  +
+y:  6  7  8  9
+    7  9 11 13

Here is how we would add two vectors together using a for loop:


R +

+output_vector <- c()
+for (i in 1:4) {
+  output_vector[i] <- x[i] + y[i]


[1]  7  9 11 13

Compare this to the output using vectorised operations.


R +

+sum_xy <- x + y


[1]  7  9 11 13
+ +

Challenge 1 +


Let’s try this on the pop column of the +gapminder dataset.


Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+ +

Let’s try this on the pop column of the +gapminder dataset.


Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.


R +

+gapminder$pop_millions <- gapminder$pop / 1e6


      country year      pop continent lifeExp gdpPercap pop_millions
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453     8.425333
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530     9.240934
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007    10.267083
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971    11.537966
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811    13.079460
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134    14.880372
+ +

Challenge 2 +


On a single graph, plot population, in millions, against year, for +all countries. Do not worry about identifying which country is +which.


Repeat the exercise, graphing only for China, India, and Indonesia. +Again, do not worry about which is which.

+ +

Refresh your plotting skills by plotting population in millions +against year.


R +

+ggplot(gapminder, aes(x = year, y = pop_millions)) +
+ geom_point()
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

R +

+countryset <- c("China","India","Indonesia")
+ggplot(gapminder[gapminder$country %in% countryset,],
+       aes(x = year, y = pop_millions)) +
+  geom_point()
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

Comparison operators, logical operators, and many functions are also +vectorized:


Comparison operators


R +

+x > 2



Logical operators


R +

+a <- x > 3  # or, for clarity, a <- (x > 3)


+ +

Tip: some useful functions for logical +vectors +


any() will return TRUE if any +element of a vector is TRUE.
all() will return TRUE if all +elements of a vector are TRUE.


Most functions also operate element-wise on vectors:




R +

+x <- 1:4


[1] 0.0000000 0.6931472 1.0986123 1.3862944

Vectorized operations work element-wise on matrices:


R +

+m <- matrix(1:12, nrow=3, ncol=4)
+m * -1


     [,1] [,2] [,3] [,4]
+[1,]   -1   -4   -7  -10
+[2,]   -2   -5   -8  -11
+[3,]   -3   -6   -9  -12
+ +

Tip: element-wise vs. matrix +multiplication +


Very important: the operator * gives you element-wise +multiplication! To do matrix multiplication, we need to use the +%*% operator:


R +

+m %*% matrix(1, nrow=4, ncol=1)


+[1,]   22
+[2,]   26
+[3,]   30

R +

+matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)


+[1,]   30

For more on matrix algebra, see the Quick-R +reference guide

+ +

Challenge 3 +


Given the following matrix:


R +

+m <- matrix(1:12, nrow=3, ncol=4)


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12

Write down what you think will happen when you run:

  1. m ^ -1
  2. +
  3. m * c(1, 0, -1)
  4. +
  5. m > c(0, 20)
  6. +
  7. m * c(1, 0, -1, 2)
  8. +

Did you get the output you expected? If not, ask a helper!

+ +

Given the following matrix:


R +

+m <- matrix(1:12, nrow=3, ncol=4)


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12

Write down what you think will happen when you run:

  1. m ^ -1
  2. +


          [,1]      [,2]      [,3]       [,4]
+[1,] 1.0000000 0.2500000 0.1428571 0.10000000
+[2,] 0.5000000 0.2000000 0.1250000 0.09090909
+[3,] 0.3333333 0.1666667 0.1111111 0.08333333
  1. m * c(1, 0, -1)
  2. +


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    0    0    0    0
+[3,]   -3   -6   -9  -12
  1. m > c(0, 20)
  2. +


      [,1]  [,2]  [,3]  [,4]
+ +

Challenge 4 +


We’re interested in looking at the sum of the following sequence of +fractions:


R +

+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000?

+ +

We’re interested in looking at the sum of the following sequence of +fractions:


R +

+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

This would be tedious to type out, and impossible for high values of +n. Can you use vectorisation to compute x, when n=100? How about when +n=10,000?


R +



[1] 1.634984

R +



[1] 1.644834

R +

+n <- 10000


[1] 1.644834

We can also obtain the same results using a function:


R +

+inverse_sum_of_squares <- function(n) {
+  sum(1/(1:n)^2)


[1] 1.634984

R +



[1] 1.644834

R +

+n <- 10000


[1] 1.644834
+ +

Tip: Operations on vectors of unequal +length +


Operations can also be performed on vectors of unequal length, +through a process known as recycling. This process +automatically repeats the smaller vector until it matches the length of +the larger vector. R will provide a warning if the larger vector is not +a multiple of the smaller vector.


R +

+x <- c(1, 2, 3)
+y <- c(1, 2, 3, 4, 5, 6, 7)
+x + y


Warning in x + y: longer object length is not a multiple of shorter object


[1] 2 4 6 5 7 9 8

Vector x was recycled to match the length of vector +y


R +

x:  1  2  3  1  2  3  1
+    +  +  +  +  +  +  +
+y:  1  2  3  4  5  6  7
+    2  4  6  5  7  9  8
+ +

Keypoints +

  • Use vectorized operations instead of loops.
  • +

Content from Functions Explained


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I write a new function in R?
  • +


  • Define a function that takes arguments.
  • +
  • Return a value from a function.
  • +
  • Check argument conditions with stopifnot() in +functions.
  • +
  • Test a function.
  • +
  • Set default values for function arguments.
  • +
  • Explain why we should divide programs into small, single-purpose +functions.
  • +

If we only had one data set to analyze, it would probably be faster +to load the file into a spreadsheet and use that to plot simple +statistics. However, the gapminder data is updated periodically, and we +may want to pull in that new information later and re-run our analysis +again. We may also obtain similar data from a different source in the +future.


In this lesson, we’ll learn how to write a function so that we can +repeat several operations with a single command.

+ +

What is a function? +


Functions gather a sequence of operations into a whole, preserving it +for ongoing use. Functions provide:

  • a name we can remember and invoke it by
  • +
  • relief from the need to remember the individual operations
  • +
  • a defined set of inputs and expected outputs
  • +
  • rich connections to the larger programming environment
  • +

As the basic building block of most programming languages, +user-defined functions constitute “programming” as much as any single +abstraction can. If you have written a function, you are a computer +programmer.


Defining a function +


Let’s open a new R script file in the functions/ +directory and call it functions-lesson.R.


The general structure of a function is:


R +

+my_function <- function(parameters) {
+  # perform action
+  # return value

Let’s define a function fahr_to_kelvin() that converts +temperatures from Fahrenheit to Kelvin:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

We define fahr_to_kelvin() by assigning it to the output +of function. The list of argument names are contained +within parentheses. Next, the body of +the function–the statements that are executed when it runs–is contained +within curly braces ({}). The statements in the body are +indented by two spaces. This makes the code easier to read but does not +affect how the code operates.


It is useful to think of creating functions like writing a cookbook. +First you define the “ingredients” that your function needs. In this +case, we only need one ingredient to use our function: “temp”. After we +list our ingredients, we then say what we will do with them, in this +case, we are taking our ingredient and applying a set of mathematical +operators to it.


When we call the function, the values we pass to it as arguments are +assigned to those variables so that we can use them inside the function. +Inside the function, we use a return statement to send a +result back to whoever asked for it.

+ +

Tip +


One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the +body of the function. But for clarity, we will explicitly define the +return statement.


Let’s try running our function. Calling our own function is no +different from calling any other function:


R +

+# freezing point of water


[1] 273.15

R +

+# boiling point of water


[1] 373.15
+ +

Challenge 1 +


Write a function called kelvin_to_celsius() that takes a +temperature in Kelvin and returns that temperature in Celsius.


Hint: To convert from Kelvin to Celsius you subtract 273.15

+ +

Write a function called kelvin_to_celsius that takes a +temperature in Kelvin and returns that temperature in Celsius


R +

+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)

Combining functions +


The real power of functions comes from mixing, matching and combining +them into ever-larger chunks to get the effect we want.


Let’s define two functions that will convert temperature from +Fahrenheit to Kelvin, and Kelvin to Celsius:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+kelvin_to_celsius <- function(temp) {
+  celsius <- temp - 273.15
+  return(celsius)
+ +

Challenge 2 +


Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer).

+ +

Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above


R +

+fahr_to_celsius <- function(temp) {
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)

Interlude: Defensive Programming +


Now that we’ve begun to appreciate how writing functions provides an +efficient way to make R code re-usable and modular, we should note that +it is important to ensure that functions only work in their intended +use-cases. Checking function parameters is related to the concept of +defensive programming. Defensive programming encourages us to +frequently check conditions and throw an error if something is wrong. +These checks are referred to as assertion statements because we want to +assert some condition is TRUE before proceeding. They make +it easier to debug because they give us a better idea of where the +errors originate.


Checking conditions with stopifnot() + +


Let’s start by re-examining fahr_to_kelvin(), our +function for converting temperatures from Fahrenheit to Kelvin. It was +defined like so:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

For this function to work as intended, the argument temp +must be a numeric value; otherwise, the mathematical +procedure for converting between the two temperature scales will not +work. To create an error, we can use the function stop(). +For example, since the argument temp must be a +numeric vector, we could check for this condition with an +if statement and throw an error if the condition was +violated. We could augment our function above like so:


R +

+fahr_to_kelvin <- function(temp) {
+  if (!is.numeric(temp)) {
+    stop("temp must be a numeric vector.")
+  }
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

If we had multiple conditions or arguments to check, it would take +many lines of code to check all of them. Luckily R provides the +convenience function stopifnot(). We can list as many +requirements that should evaluate to TRUE; +stopifnot() throws an error if it finds one that is +FALSE. Listing these conditions also serves a secondary +purpose as extra documentation for the function.


Let’s try out defensive programming with stopifnot() by +adding assertions to check the input to our function +fahr_to_kelvin().


We want to assert the following: temp is a numeric +vector. We may do that like so:


R +

+fahr_to_kelvin <- function(temp) {
+  stopifnot(is.numeric(temp))
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

It still works when given proper input.


R +

+# freezing point of water
+fahr_to_kelvin(temp = 32)


[1] 273.15

But fails instantly if given improper input.


R +

+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp = as.factor(32))


Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+ +

Challenge 3 +


Use defensive programming to ensure that our +fahr_to_celsius() function throws an error immediately if +the argument temp is specified inappropriately.

+ +

Extend our previous definition of the function by adding in an +explicit call to stopifnot(). Since +fahr_to_celsius() is a composition of two other functions, +checking inside here makes adding checks to the two component functions +redundant.


R +

+fahr_to_celsius <- function(temp) {
+  stopifnot(is.numeric(temp))
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)

More on combining functions +


Now, we’re going to define a function that calculates the Gross +Domestic Product of a nation from the data available in our dataset:


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat) {
+  gdp <- dat$pop * dat$gdpPercap
+  return(gdp)

We define calcGDP() by assigning it to the output of +function. The list of argument names are contained within +parentheses. Next, the body of the function -- the statements executed +when you call the function – is contained within curly braces +({}).


We’ve indented the statements in the body by two spaces. This makes +the code easier to read but does not affect how it operates.


When we call the function, the values we pass to it are assigned to +the arguments, which become variables inside the body of the +function.


Inside the function, we use the return() function to +send back the result. This return() function is optional: R +will automatically return the results of whatever command is executed on +the last line of the function.


R +



[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231

That’s not very informative. Let’s add some more arguments so we can +extract that per year and country.


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

If you’ve been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by +using the source() function:


R +


Ok, so there’s a lot going on in this function now. In plain English, +the function now subsets the provided data by year if the year argument +isn’t empty, then subsets the result by country if the country argument +isn’t empty. Then it calculates the GDP for whatever subset emerges from +the previous two steps. The function then adds the GDP as a new column +to the subsetted data and returns this as the final result. You can see +that the output is much more informative than a vector of numbers.


Let’s take a look at what happens when we specify the year:


R +

+head(calcGDP(gapminder, year=2007))


       country year      pop continent lifeExp  gdpPercap          gdp
+12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
+24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
+36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
+48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
+60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
+72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894

Or for a specific country:


R +

+calcGDP(gapminder, country="Australia")


     country year      pop continent lifeExp gdpPercap          gdp
+61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
+62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
+63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
+64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
+65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
+66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
+67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
+68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
+69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
+70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
+71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Or both:


R +

+calcGDP(gapminder, year=2007, country="Australia")


     country year      pop continent lifeExp gdpPercap          gdp
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Let’s walk through the body of the function:


R +

calcGDP <- function(dat, year=NULL, country=NULL) {

Here we’ve added two arguments, year, and +country. We’ve set default arguments for both as +NULL using the = operator in the function +definition. This means that those arguments will take on those values +unless the user specifies otherwise.


R +

+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }

Here, we check whether each additional argument is set to +null, and whenever they’re not null overwrite +the dataset stored in dat with a subset given by the +non-null argument.


Building these conditionals into the function makes it more flexible +for later. Now, we can use it to calculate the GDP for:

  • The whole dataset;
  • +
  • A single year;
  • +
  • A single country;
  • +
  • A single combination of year and country.
  • +

By using %in% instead, we can also give multiple years +or countries to those arguments.

+ +

Tip: Pass by value +


Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify dat inside the +function we are modifying the copy of the gapminder dataset stored in +dat, not the original variable we gave as the first +argument.


This is called “pass-by-value” and it makes writing code much safer: +you can always be sure that whatever changes you make within the body of +the function, stay inside the body of the function.

+ +

Tip: Function scope +


Another important concept is scoping: any variables (or functions!) +you create or modify inside the body of a function only exist for the +lifetime of the function’s execution. When we call +calcGDP(), the variables dat, gdp +and new only exist inside the body of the function. Even if +we have variables of the same name in our interactive R session, they +are not modified in any way when executing a function.


R +

  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

Finally, we calculated the GDP on our new subset, and created a new +data frame with that column added. This means when we call the function +later we can see the context for the returned GDP values, which is much +better than in our first attempt where we got a vector of numbers.

+ +

Challenge 4 +


Test out your GDP function by calculating the GDP for New Zealand in +1987. How does this differ from New Zealand’s GDP in 1952?

+ +

R +

+  calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand")

GDP for New Zealand in 1987: 65050008703


GDP for New Zealand in 1952: 21058193787

+ +

Challenge 5 +


The paste() function can be used to combine text +together, e.g:


R +

+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+paste(best_practice, collapse=" ")


[1] "Write programs for people not computers"

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:


R +

+fence(text=best_practice, wrapper="***")

Note: the paste() function has an argument +called sep, which specifies the separator between text. The +default is a space: ” “. The default for paste0() is no +space”“.

+ +

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:


R +

+fence <- function(text, wrapper){
+  text <- c(wrapper, text, wrapper)
+  result <- paste(text, collapse = " ")
+  return(result)
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+fence(text=best_practice, wrapper="***")


[1] "*** Write programs for people not computers ***"
+ +

Tip +


R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the R +Language Manual or this chapter from Advanced R Programming by Hadley +Wickham.

+ +

Tip: Testing and documenting +


It’s important to both test functions and document them: +Documentation helps you, and others, understand what the purpose of your +function is, and how to use it, and its important to make sure that your +function actually does what you think.


When you first start out, your workflow will probably look a lot like +this:

  1. Write a function
  2. +
  3. Comment parts of the function to document its behaviour
  4. +
  5. Load in the source file
  6. +
  7. Experiment with it in the console to make sure it behaves as you +expect
  8. +
  9. Make any necessary bug fixes
  10. +
  11. Rinse and repeat.
  12. +

Formal documentation for functions, written in separate +.Rd files, gets turned into the documentation you see in +help files. The roxygen2 +package allows R coders to write documentation alongside the function +code and then process it into the appropriate .Rd files. +You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In +fact, packages are, in essence, bundles of functions with this formal +documentation. Loading your own functions through +source("functions.R") is equivalent to loading someone +else’s functions (or your own one day!) through +library("package").


Formal automated tests can be written using the testthat package.

+ +

Keypoints +

  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +

Content from Writing Data


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I save plots and data created in R?
  • +


  • To be able to write out plots and data from R.
  • +

Saving plots +


You have already seen how to save the most recent plot you create in +ggplot2, using the command ggsave. As a +refresher:


R +


You can save a plot from within RStudio using the ‘Export’ button in +the ‘Plot’ window. This will give you the option of saving as a .pdf or +as .png, .jpg or other image formats.


Sometimes you will want to save plots without creating them in the +‘Plot’ window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you’re looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can’t stop +the loop to click ‘Export’ for each one.


In this case you can use a more flexible approach. The function +pdf creates a new pdf device. You can control the size and +resolution using the arguments to this function.


R +

+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
+  geom_line() +
+  theme(legend.position = "none")
+# You then have to make sure to turn off the pdf device!

Open up this document and have a look.

+ +

Challenge 1 +


Rewrite your ‘pdf’ command to print a second page in the pdf, showing +a facet plot (hint: use facet_grid) of the same data with +one panel per continent.

+ +

R +

+pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
+p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
+  geom_line() +
+  theme(legend.position = "none")
+p + facet_grid(~continent)

The commands jpeg, png etc. are used +similarly to produce documents in different formats.


Writing data +


At some point, you’ll also want to write out data from R.


We can use the write.table function for this, which is +very similar to read.table from before.


Let’s create a data-cleaning script, for this analysis, we only want +to focus on the gapminder data for Australia:


R +

+aust_subset <- gapminder[gapminder$country == "Australia",]
+  file="cleaned-data/gapminder-aus.csv",
+  sep=","

Let’s switch back to the shell to take a look at the data to make +sure it looks OK:



head cleaned-data/gapminder-aus.csv



Hmm, that’s not quite what we wanted. Where did all these quotation +marks come from? Also the row numbers are meaningless.


Let’s look at the help file to work out how to change this +behaviour.


R +


By default R will wrap character vectors with quotation marks when +writing out to file. It will also write out the row and column +names.


Let’s fix this:


R +

+  gapminder[gapminder$country == "Australia",],
+  file="cleaned-data/gapminder-aus.csv",
+  sep=",", quote=FALSE, row.names=FALSE

Now lets look at the data again using our shell skills:



head cleaned-data/gapminder-aus.csv



That looks better!

+ +

Challenge 2 +


Write a data-cleaning script file that subsets the gapminder data to +include only data points collected since 1990.


Use this script to write out the new subset to a file in the +cleaned-data/ directory.

+ +

R +

+  gapminder[gapminder$year > 1990, ],
+  file = "cleaned-data/gapminder-after1990.csv",
+  sep = ",", quote = FALSE, row.names = FALSE
+ +

Keypoints +

  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +

Content from Splitting and Combining Data Frames with plyr


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I do different calculations on different sets of data?
  • +


  • To be able to use the split-apply-combine strategy for data +analysis.
  • +

Previously we looked at how you can use functions to simplify your +code. We defined the calcGDP function, which takes the +gapminder dataset, and multiplies the population and GDP per capita +column. We also defined additional arguments so we could filter by +year and country:


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

A common task you’ll encounter when working with data, is that you’ll +want to run calculations on different groups within the data. In the +above, we were calculating the GDP by multiplying two columns together. +But what if we wanted to calculated the mean GDP per continent?


We could run calcGDP and then take the mean of each +continent:


R +

+withGDP <- calcGDP(gapminder)
+mean(withGDP[withGDP$continent == "Africa", "gdp"])


[1] 20904782844

R +

+mean(withGDP[withGDP$continent == "Americas", "gdp"])


[1] 379262350210

R +

+mean(withGDP[withGDP$continent == "Asia", "gdp"])


[1] 227233738153

But this isn’t very nice. Yes, by using a function, you have +reduced a substantial amount of repetition. That is +nice. But there is still repetition. Repeating yourself will cost you +time, both now and later, and potentially introduce some nasty bugs.


We could write a new function that is flexible like +calcGDP, but this also takes a substantial amount of effort +and testing to get right.


The abstract problem we’re encountering here is know as +“split-apply-combine”:

Split apply combine

We want to split our data into groups, in this case +continents, apply some calculations on that group, then +optionally combine the results together afterwards.


The plyr package +


For those of you who have used R before, you might be familiar with +the apply family of functions. While R’s built in functions +do work, we’re going to introduce you to another method for solving the +“split-apply-combine” problem. The plyr package provides a set of +functions that we find more user friendly for solving this problem.


We installed this package in an earlier challenge. Let us load it +now:


R +


Plyr has functions for operating on lists, +data.frames and arrays (matrices, or +n-dimensional vectors). Each function performs:

  1. A splitting operation
  2. +
  3. +Apply a function on each split in turn.
  4. +
  5. Recombine output data as a single data object.
  6. +

The functions are named based on the data structure they expect as +input, and the data structure you want returned as output: [a]rray, +[l]ist, or [d]ata.frame. The first letter corresponds to the input data +structure, the second letter to the output data structure, and then the +rest of the function is named “ply”.


This gives us 9 core functions **ply. There are an additional three +functions which will only perform the split and apply steps, and not any +combine step. They’re named by their input data type and represent null +output by a _ (see table)


Note here that plyr’s use of “array” is different to R’s, an array in +ply can include a vector or matrix.

Full apply suite

Each of the xxply functions (daply, ddply, +llply, laply, …) has the same structure and +has 4 key features and structure:


R +

+xxply(.data, .variables, .fun)
  • The first letter of the function name gives the input type and the +second gives the output type.
  • +
  • .data - gives the data object to be processed
  • +
  • .variables - identifies the splitting variables
  • +
  • .fun - gives the function to be called on each piece
  • +

Now we can quickly calculate the mean GDP per continent:


R +

+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)


  continent           V1
+1    Africa  20904782844
+2  Americas 379262350210
+3      Asia 227233738153
+4    Europe 269442085301
+5   Oceania 188187105354

Let us walk through the previous code:

  • The ddply function feeds in a data.frame +(function starts with d) and returns another +data.frame (2nd letter is a d)
  • +
  • the first argument we gave was the data.frame we wanted to operate +on: in this case the gapminder data. We called calcGDP on +it first so that it would have the additional gdp column +added to it.
  • +
  • The second argument indicated our split criteria: in this case the +“continent” column. Note that we gave the name of the column, not the +values of the column like we had done previously with subsetting. Plyr +takes care of these implementation details for you.
  • +
  • The third argument is the function we want to apply to each grouping +of the data. We had to define our own short function here: each subset +of the data gets stored in x, the first argument of our +function. This is an anonymous function: we haven’t defined it +elsewhere, and it has no name. It only exists in the scope of our call +to ddply.
  • +
+ +

Challenge 1 +


Calculate the average life expectancy per continent. Which has the +longest? Which has the shortest?

+ +

R +

+ .data = gapminder,
+ .variables = "continent",
+ .fun = function(x) mean(x$lifeExp)

Oceania has the longest and Africa the shortest.


What if we want a different type of output data structure?:


R +

+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)


+[1] 20904782844
+[1] 379262350210
+[1] 227233738153
+[1] 269442085301
+[1] 188187105354
+[1] "data.frame"
+  continent
+1    Africa
+2  Americas
+3      Asia
+4    Europe
+5   Oceania

We called the same function again, but changed the second letter to +an l, so the output was returned as a list.


We can specify multiple columns to group by:


R +

+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)


   continent year           V1
+1     Africa 1952   5992294608
+2     Africa 1957   7359188796
+3     Africa 1962   8784876958
+4     Africa 1967  11443994101
+5     Africa 1972  15072241974
+6     Africa 1977  18694898732
+7     Africa 1982  22040401045
+8     Africa 1987  24107264108
+9     Africa 1992  26256977719
+10    Africa 1997  30023173824
+11    Africa 2002  35303511424
+12    Africa 2007  45778570846
+13  Americas 1952 117738997171
+14  Americas 1957 140817061264
+15  Americas 1962 169153069442
+16  Americas 1967 217867530844
+17  Americas 1972 268159178814
+18  Americas 1977 324085389022
+19  Americas 1982 363314008350
+20  Americas 1987 439447790357
+21  Americas 1992 489899820623
+22  Americas 1997 582693307146
+23  Americas 2002 661248623419
+24  Americas 2007 776723426068
+25      Asia 1952  34095762661
+26      Asia 1957  47267432088
+27      Asia 1962  60136869012
+28      Asia 1967  84648519224
+29      Asia 1972 124385747313
+30      Asia 1977 159802590186
+31      Asia 1982 194429049919
+32      Asia 1987 241784763369
+33      Asia 1992 307100497486
+34      Asia 1997 387597655323
+35      Asia 2002 458042336179
+36      Asia 2007 627513635079
+37    Europe 1952  84971341466
+38    Europe 1957 109989505140
+39    Europe 1962 138984693095
+40    Europe 1967 173366641137
+41    Europe 1972 218691462733
+42    Europe 1977 255367522034
+43    Europe 1982 279484077072
+44    Europe 1987 316507473546
+45    Europe 1992 342703247405
+46    Europe 1997 383606933833
+47    Europe 2002 436448815097
+48    Europe 2007 493183311052
+49   Oceania 1952  54157223944
+50   Oceania 1957  66826828013
+51   Oceania 1962  82336453245
+52   Oceania 1967 105958863585
+53   Oceania 1972 134112109227
+54   Oceania 1977 154707711162
+55   Oceania 1982 176177151380
+56   Oceania 1987 209451563998
+57   Oceania 1992 236319179826
+58   Oceania 1997 289304255183
+59   Oceania 2002 345236880176
+60   Oceania 2007 403657044512

R +

+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)


+continent          1952         1957         1962         1967         1972
+  Africa     5992294608   7359188796   8784876958  11443994101  15072241974
+  Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+  Asia      34095762661  47267432088  60136869012  84648519224 124385747313
+  Europe    84971341466 109989505140 138984693095 173366641137 218691462733
+  Oceania   54157223944  66826828013  82336453245 105958863585 134112109227
+          year
+continent          1977         1982         1987         1992         1997
+  Africa    18694898732  22040401045  24107264108  26256977719  30023173824
+  Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+  Asia     159802590186 194429049919 241784763369 307100497486 387597655323
+  Europe   255367522034 279484077072 316507473546 342703247405 383606933833
+  Oceania  154707711162 176177151380 209451563998 236319179826 289304255183
+          year
+continent          2002         2007
+  Africa    35303511424  45778570846
+  Americas 661248623419 776723426068
+  Asia     458042336179 627513635079
+  Europe   436448815097 493183311052
+  Oceania  345236880176 403657044512

You can use these functions in place of for loops (and +it is usually faster to do so). To replace a for loop, put the code that +was in the body of the for loop inside an anonymous +function.


R +

+  .data=gapminder,
+  .variables = "continent",
+  .fun = function(x) {
+    meanGDPperCap <- mean(x$gdpPercap)
+    print(paste(
+      "The mean GDP per capita for", unique(x$continent),
+      "is", format(meanGDPperCap, big.mark=",")
+   ))
+  }


[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+ +

Tip: printing numbers +


The format function can be used to make numeric values +“pretty” for printing out in messages.

+ +

Challenge 2 +


Calculate the average life expectancy per continent and year. Which +had the longest and shortest in 2007? Which had the greatest change in +between 1952 and 2007?

+ +

R +

+solution <- ddply(
+ .data = gapminder,
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$lifeExp)
+solution_2007 <- solution[solution$year == 2007, ]

Oceania had the longest average life expectancy in 2007 and Africa +the lowest.


R +

+solution_1952_2007 <- cbind(solution[solution$year == 1952, ], solution_2007)
+difference_1952_2007 <- data.frame(continent = solution_1952_2007$continent,
+                                   year_1957 = solution_1952_2007[[3]],
+                                   year_2007 = solution_1952_2007[[6]],
+                                   difference = solution_1952_2007[[6]] - solution_1952_2007[[3]])

Asia had the greatest difference, and Oceania the least.

+ +

Alternate Challenge +


Without running them, which of the following will calculate the +average life expectancy per continent:

  1. +

R +

+  .data = gapminder,
+  .variables = gapminder$continent,
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = mean(dataGroup$lifeExp)
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+ +

Answer 3 will calculate the average life expectancy per +continent.

+ +

Keypoints +

  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +

Content from Data Frame Manipulation with dplyr


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I manipulate data frames without repeating myself?
  • +


  • To be able to use the six main data frame manipulation ‘verbs’ with +pipes in dplyr.
  • +
  • To understand how group_by() and +summarize() can be combined to summarize datasets.
  • +
  • Be able to analyze a subset of data using logical filtering.
  • +

Manipulation of data frames means many things to many researchers: we +often select certain observations (rows) or variables (columns), we +often group the data by a certain variable(s), or we even calculate +summary statistics. We can do these operations using the normal base R +operations:


R +

+mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])


[1] 2193.755

R +

+mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])


[1] 7136.11

R +

+mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])


[1] 7902.15

But this isn’t very nice because there is a fair bit of +repetition. Repeating yourself will cost you time, both now and later, +and potentially introduce some nasty bugs.


The dplyr package +


Luckily, the dplyr +package provides a number of very useful functions for manipulating data +frames in a way that will reduce the above repetition, reduce the +probability of making errors, and probably even save you some typing. As +an added bonus, you might even find the dplyr grammar +easier to read.

+ +

Tip: Tidyverse +


dplyr package belongs to a broader family of opinionated +R packages designed for data science called the “Tidyverse”. These +packages are specifically designed to work harmoniously together. Some +of these packages will be covered along this course, but you can find +more complete information here: https://www.tidyverse.org/.


Here we’re going to cover 5 of the most commonly used functions as +well as using pipes (%>%) to combine them.

  1. select()
  2. +
  3. filter()
  4. +
  5. group_by()
  6. +
  7. summarize()
  8. +
  9. mutate()
  10. +

If you have have not installed this package earlier, please do +so:


R +


Now let’s load the package:


R +


Using select() +


If, for example, we wanted to move forward with only a few of the +variables in our data frame we could use the select() +function. This will keep only the variables you select.


R +

+year_country_gdp <- select(gapminder, year, country, gdpPercap)

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.


R +

+smaller_gapminder_data <- select(gapminder, -continent)

If we open up year_country_gdp we’ll see that it only +contains the year, country and gdpPercap. Above we used ‘normal’ +grammar, but the strengths of dplyr lie in combining +several functions using pipes. Since the pipes grammar is unlike +anything we’ve seen in R before, let’s repeat what we’ve done above +using pipes.


R +

+year_country_gdp <- gapminder %>% select(year, country, gdpPercap)

To help you understand why we wrote that in that way, let’s walk +through it step by step. First we summon the gapminder data frame and +pass it on, using the pipe symbol %>%, to the next step, +which is the select() function. In this case we don’t +specify which data object we use in the select() function +since in gets that from the previous pipe. Fun Fact: +There is a good chance you have encountered pipes before in the shell. +In R, a pipe symbol is %>% while in the shell it is +| but the concept is the same!

+ +

Tip: Renaming data frame columns in dplyr +


In Chapter 4 we covered how you can rename columns with base R by +assigning a value to the output of the names() function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a +rename() function.


Within a pipeline, the syntax is +rename(new_name = old_name). For example, we may want to +rename the gdpPercap column name from our select() +statement above.


R +

+tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)


  year     country gdp_per_capita
+1 1952 Afghanistan       779.4453
+2 1957 Afghanistan       820.8530
+3 1962 Afghanistan       853.1007
+4 1967 Afghanistan       836.1971
+5 1972 Afghanistan       739.9811
+6 1977 Afghanistan       786.1134

Using filter() +


If we now want to move forward with the above, but only with European +countries, we can combine select and +filter


R +

+year_country_gdp_euro <- gapminder %>%
+    filter(continent == "Europe") %>%
+    select(year, country, gdpPercap)

If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below.


R +

+europe_lifeExp_2007 <- gapminder %>%
+  filter(continent == "Europe", year == 2007) %>%
+  select(country, lifeExp)
+ +

Challenge 1 +


Write a single command (which can span multiple lines and includes +pipes) that will produce a data frame that has the African values for +lifeExp, country and year, but +not for other Continents. How many rows does your data frame have and +why?

+ +

R +

+year_country_lifeExp_Africa <- gapminder %>%
+                           filter(continent == "Africa") %>%
+                           select(year, country, lifeExp)

As with last time, first we pass the gapminder data frame to the +filter() function, then we pass the filtered version of the +gapminder data frame to the select() function. +Note: The order of operations is very important in this +case. If we used ‘select’ first, filter would not be able to find the +variable continent since we would have removed it in the previous +step.


Using group_by() +


Now, we were supposed to be reducing the error prone repetitiveness +of what can be done with base R, but up to now we haven’t done that +since we would have to repeat the above for each continent. Instead of +filter(), which will only pass observations that meet your +criteria (in the above: continent=="Europe"), we can use +group_by(), which will essentially use every unique +criteria that you could have used in filter.


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...

R +

+str(gapminder %>% group_by(continent))


gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
+  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
+  ..$ .rows    : list<int> [1:5] 
+  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
+  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
+  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
+  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
+  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
+  .. ..@ ptype: int(0) 
+  ..- attr(*, ".drop")= logi TRUE

You will notice that the structure of the data frame where we used +group_by() (grouped_df) is not the same as the +original gapminder (data.frame). A +grouped_df can be thought of as a list where +each item in the listis a data.frame which +contains only the rows that correspond to the a particular value +continent (at least in the example above).

Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize() +


The above was a bit on the uneventful side but +group_by() is much more exciting in conjunction with +summarize(). This will allow us to create new variable(s) +by using functions that repeat for each of the continent-specific data +frames. That is to say, using the group_by() function, we +split our original data frame into multiple pieces, then we can run +functions (e.g. mean() or sd()) within +summarize().


R +

+gdp_bycontinents <- gapminder %>%
+    group_by(continent) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
Diagram illustrating the use of group by and summarize together to create a new variable

R +

continent mean_gdpPercap
+     <fctr>          <dbl>
+1    Africa       2193.755
+2  Americas       7136.110
+3      Asia       7902.150
+4    Europe      14469.476
+5   Oceania      18621.609

That allowed us to calculate the mean gdpPercap for each continent, +but it gets even better.

+ +

Challenge 2 +


Calculate the average life expectancy per country. Which has the +longest average life expectancy and which has the shortest average life +expectancy?

+ +

R +

+lifeExp_bycountry <- gapminder %>%
+   group_by(country) %>%
+   summarize(mean_lifeExp = mean(lifeExp))
+lifeExp_bycountry %>%
+   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))


# A tibble: 2 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Iceland              76.5
+2 Sierra Leone         36.8

Another way to do this is to use the dplyr function +arrange(), which arranges the rows in a data frame +according to the order of one or more variables from the data frame. It +has similar syntax to other functions from the dplyr +package. You can use desc() inside arrange() +to sort in descending order.


R +

+lifeExp_bycountry %>%
+   arrange(mean_lifeExp) %>%
+   head(1)


# A tibble: 1 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Sierra Leone         36.8

R +

+lifeExp_bycountry %>%
+   arrange(desc(mean_lifeExp)) %>%
+   head(1)


# A tibble: 1 × 2
+  country mean_lifeExp
+  <chr>          <dbl>
+1 Iceland         76.5

Alphabetical order works too


R +

+lifeExp_bycountry %>%
+   arrange(desc(country)) %>%
+   head(1)


# A tibble: 1 × 2
+  country  mean_lifeExp
+  <chr>           <dbl>
+1 Zimbabwe         52.7

The function group_by() allows us to group by multiple +variables. Let’s group by year and +continent.


R +

+gdp_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

That is already quite powerful, but it gets even better! You’re not +limited to defining 1 new variable in summarize().


R +

+gdp_pop_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

count() and n() +


A very common operation is to count the number of observations for +each group. The dplyr package comes with two related +functions that help with this.


For instance, if we wanted to check the number of countries included +in the dataset for the year 2002, we can use the count() +function. It takes the name of one or more columns that contain the +groups we are interested in, and we can optionally sort the results in +descending order by adding sort=TRUE:


R +

+gapminder %>%
+    filter(year == 2002) %>%
+    count(continent, sort = TRUE)


  continent  n
+1    Africa 52
+2      Asia 33
+3    Europe 30
+4  Americas 25
+5   Oceania  2

If we need to use the number of observations in calculations, the +n() function is useful. It will return the total number of +observations in the current group rather than counting the number of +observations in each group within a specific column. For instance, if we +wanted to get the standard error of the life expectency per +continent:


R +

+gapminder %>%
+    group_by(continent) %>%
+    summarize(se_le = sd(lifeExp)/sqrt(n()))


# A tibble: 5 × 2
+  continent se_le
+  <chr>     <dbl>
+1 Africa    0.366
+2 Americas  0.540
+3 Asia      0.596
+4 Europe    0.286
+5 Oceania   0.775

You can also chain together several summary operations; in this case +calculating the minimum, maximum, +mean and se of each continent’s per-country +life-expectancy:


R +

+gapminder %>%
+    group_by(continent) %>%
+    summarize(
+      mean_le = mean(lifeExp),
+      min_le = min(lifeExp),
+      max_le = max(lifeExp),
+      se_le = sd(lifeExp)/sqrt(n()))


# A tibble: 5 × 5
+  continent mean_le min_le max_le se_le
+  <chr>       <dbl>  <dbl>  <dbl> <dbl>
+1 Africa       48.9   23.6   76.4 0.366
+2 Americas     64.7   37.6   80.7 0.540
+3 Asia         60.1   28.8   82.6 0.596
+4 Europe       71.9   43.6   81.8 0.286
+5 Oceania      74.3   69.1   81.2 0.775

Using mutate() +


We can also create new variables prior to (or even after) summarizing +information using mutate().


R +

+gdp_pop_bycontinents_byyear <- gapminder %>%
+    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
+    group_by(continent,year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

Connect mutate with logical filtering: ifelse +


When creating new variables, we can hook this with a logical +condition. A simple combination of mutate() and +ifelse() facilitates filtering right where it is needed: in +the moment of creating something new. This easy-to-read statement is a +fast and powerful way of discarding certain data (even though the +overall dimension of the data frame will not change) or for updating +values depending on this given condition.


R +

+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
+    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

R +

+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
+    mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              mean_gdpPercap_expected = mean(gdp_futureExpectation))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

Combining dplyr and ggplot2 +


First install and load ggplot2:


R +


R +


In the plotting lesson we looked at how to make a multi-panel figure +by adding a layer of facet panels using ggplot2. Here is +the code we used (with some extra comments):


R +

+# Filter countries located in the Americas
+americas <- gapminder[gapminder$continent == "Americas", ]
+# Make the plot
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

This code makes the right plot but it also creates an intermediate +variable (americas) that we might not have any other uses +for. Just as we used %>% to pipe data along a chain of +dplyr functions we can use it to pass data to +ggplot(). Because %>% replaces the first +argument in a function we don’t need to specify the data = +argument in the ggplot() function. By combining +dplyr and ggplot2 functions we can make the +same figure without creating any new variables or modifying the +data.


R +

+gapminder %>%
+  # Filter countries located in the Americas
+  filter(continent == "Americas") %>%
+  # Make the plot
+  ggplot(mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

More examples of using the function mutate() and the +ggplot2 package.


R +

+gapminder %>%
+  # extract first letter of country name into new column
+  mutate(startsWith = substr(country, 1, 1)) %>%
+  # only keep countries starting with A or Z
+  filter(startsWith %in% c("A", "Z")) %>%
+  # plot lifeExp into facets
+  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
+  geom_line() +
+  facet_wrap(vars(country)) +
+  theme_minimal()
+ +

Advanced Challenge +


Calculate the average life expectancy in 2002 of 2 randomly selected +countries for each continent. Then arrange the continent names in +reverse order. Hint: Use the dplyr +functions arrange() and sample_n(), they have +similar syntax to other dplyr functions.

+ +

R +

+lifeExp_2countries_bycontinents <- gapminder %>%
+   filter(year==2002) %>%
+   group_by(continent) %>%
+   sample_n(2) %>%
+   summarize(mean_lifeExp=mean(lifeExp)) %>%
+   arrange(desc(mean_lifeExp))

Other great resources +

+ +
+ +

Keypoints +

  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +

Content from Data Frame Manipulation with tidyr


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I change the layout of a data frame?
  • +


  • To understand the concepts of ‘longer’ and ‘wider’ data frame +formats and be able to convert between them with +tidyr.
  • +

Researchers often want to reshape their data frames from ‘wide’ to +‘longer’ layouts, or vice-versa. The ‘long’ layout or format is +where:

  • each column is a variable
  • +
  • each row is an observation
  • +

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column +for the observed variable and the other columns are ID variables.


For the ‘wide’ format each row is often a site/subject/patient and +you have multiple observation variables containing the same type of +data. These can be either repeated observations over time, or +observation of multiple variables (or a mix of both). You may find data +input may be simpler or some other applications may prefer the ‘wide’ +format. However, many of R‘s functions have been designed +assuming you have ’longer’ formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format.

Diagram illustrating the difference between a wide versus long layout of a data frame

Long and wide data frame layouts mainly affect readability. For +humans, the wide format is often more intuitive since we can often see +more of the data on the screen due to its shape. However, the long +format is more machine readable and is closer to the formatting of +databases. The ID variables in our data frames are similar to the fields +in a database and observed variables are like the database values.


Getting started +


First install the packages if you haven’t already done so (you +probably installed dplyr in the previous lesson):


R +


Load the packages


R +


First, lets look at the structure of our original gapminder data +frame:


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+ +

Challenge 1 +


Is gapminder a purely long, purely wide, or some intermediate +format?

+ +

The original gapminder data.frame is in an intermediate format. It is +not purely long since it had multiple observation variables +(pop,lifeExp,gdpPercap).


Sometimes, as with the gapminder dataset, we have multiple types of +observed data. It is somewhere in between the purely ‘long’ and ‘wide’ +data formats. We have 3 “ID variables” (continent, +country, year) and 3 “Observation variables” +(pop,lifeExp,gdpPercap). This +intermediate format can be preferred despite not having ALL observations +in 1 column given that all 3 observation variables have different units. +There are few operations that would need us to make this data frame any +longer (i.e. 4 ID variables and 1 Observation variable).


While using many of the functions in R, which are often vector based, +you usually do not want to do mathematical operations on values with +different units. For example, using the purely long format, a single +mean for all of the values of population, life expectancy, and GDP would +not be meaningful since it would return the mean of values with 3 +incompatible units. The solution is that we first manipulate the data +either by grouping (see the lesson on dplyr), or we change +the structure of the data frame. Note: Some plotting +functions in R actually work better in the wide format data.


From wide to long format with pivot_longer() +


Until now, we’ve been using the nicely formatted original gapminder +dataset, but ‘real’ data (i.e. our own research data) will never be so +well organized. Here let’s start with the wide formatted version of the +gapminder dataset.


Download the wide version of the gapminder data from here and save it in your data +folder.


We’ll load the data file and look at it. Note: we don’t want our +continent and country columns to be factors, so we use the +stringsAsFactors argument for read.csv() to disable +that.


R +

+gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)


'data.frame':	142 obs. of  38 variables:
+ $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num  3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num  2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
+ $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
Diagram illustrating the wide format of the gapminder data frame

To change this very wide data frame layout back to our nice, +intermediate (or longer) layout, we will use one of the two available +pivot functions from the tidyr package. To +convert from wide to a longer format, we will use the +pivot_longer() function. pivot_longer() makes +datasets longer by increasing the number of rows and decreasing the +number of columns, or ‘lengthening’ your observation variables into a +single variable.

Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format

R +

+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+    names_to = "obstype_year", values_to = "obs_values"
+  )


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
+ $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...

Here we have used piping syntax which is similar to what we were +doing in the previous lesson with dplyr. In fact, these are compatible +and you can use a mix of tidyr and dplyr functions by piping them +together.


We first provide to pivot_longer() a vector of column +names that will be pivoted into longer format. We could type out all the +observation variables, but as in the select() function (see +dplyr lesson), we can use the starts_with() +argument to select all variables that start with the desired character +string. pivot_longer() also allows the alternative syntax +of using the - symbol to identify which variables are not +to be pivoted (i.e. ID variables).


The next arguments to pivot_longer() are +names_to for naming the column that will contain the new ID +variable (obstype_year) and values_to for +naming the new amalgamated observation variable +(obs_value). We supply these new column names as +strings.

Diagram illustrating the long format of the gapminder data

R +

+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(-continent, -country),
+    names_to = "obstype_year", values_to = "obs_values"
+  )


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values  : num [1:5112] 2449 3014 2551 3247 4183 ...

That may seem trivial with this particular data frame, but sometimes +you have 1 ID variable and 40 observation variables with irregular +variable names. The flexibility is a huge time saver!


Now obstype_year actually contains 2 pieces of +information, the observation type +(pop,lifeExp, or gdpPercap) and +the year. We can use the separate() function +to split the character strings into multiple variables


R +

+gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
+gap_long$year <- as.integer(gap_long$year)
+ +

Challenge 2 +


Using gap_long, calculate the mean life expectancy, +population, and gdpPercap for each continent. Hint: use +the group_by() and summarize() functions we +learned in the dplyr lesson

+ +

R +

+gap_long %>% group_by(continent, obs_type) %>%
+   summarize(means=mean(obs_values))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.


# A tibble: 15 × 3
+# Groups:   continent [5]
+   continent obs_type       means
+   <chr>     <chr>          <dbl>
+ 1 Africa    gdpPercap     2194. 
+ 2 Africa    lifeExp         48.9
+ 3 Africa    pop        9916003. 
+ 4 Americas  gdpPercap     7136. 
+ 5 Americas  lifeExp         64.7
+ 6 Americas  pop       24504795. 
+ 7 Asia      gdpPercap     7902. 
+ 8 Asia      lifeExp         60.1
+ 9 Asia      pop       77038722. 
+10 Europe    gdpPercap    14469. 
+11 Europe    lifeExp         71.9
+12 Europe    pop       17169765. 
+13 Oceania   gdpPercap    18622. 
+14 Oceania   lifeExp         74.3
+15 Oceania   pop        8874672. 

From long to intermediate format with pivot_wider() +


It is always good to check work. So, let’s use the second +pivot function, pivot_wider(), to ‘widen’ our +observation variables back out. pivot_wider() is the +opposite of pivot_longer(), making a dataset wider by +increasing the number of columns and decreasing the number of rows. We +can use pivot_wider() to pivot or reshape our +gap_long to the original intermediate format or the widest +format. Let’s start with the intermediate format.


The pivot_wider() function takes names_from +and values_from arguments.


To names_from we supply the column name whose contents +will be pivoted into new output columns in the widened data frame. The +corresponding values will be added from the column named in the +values_from argument.


R +

+gap_normal <- gap_long %>%
+  pivot_wider(names_from = obs_type, values_from = obs_values)


[1] 1704    6

R +



[1] 1704    6

R +



[1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"      

R +



[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

Now we’ve got an intermediate data frame gap_normal with +the same dimensions as the original gapminder, but the +order of the variables is different. Let’s fix that before checking if +they are all.equal().


R +

+gap_normal <- gap_normal[, names(gapminder)]
+all.equal(gap_normal, gapminder)


[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+[3] "Component \"country\": 1704 string mismatches"                                         
+[4] "Component \"pop\": Mean relative difference: 1.634504"                                 
+[5] "Component \"continent\": 1212 string mismatches"                                       
+[6] "Component \"lifeExp\": Mean relative difference: 0.203822"                             
+[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"                           

R +



# A tibble: 6 × 6
+  country  year      pop continent lifeExp gdpPercap
+  <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>
+1 Algeria  1952  9279525 Africa       43.1     2449.
+2 Algeria  1957 10270856 Africa       45.7     3014.
+3 Algeria  1962 11000948 Africa       48.3     2551.
+4 Algeria  1967 12760499 Africa       51.4     3247.
+5 Algeria  1972 14760787 Africa       54.5     4183.
+6 Algeria  1977 17152804 Africa       58.0     4910.

R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134

We’re almost there, the original was sorted by country, +then year.


R +

+gap_normal <- gap_normal %>% arrange(country, year)
+all.equal(gap_normal, gapminder)


[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

That’s great! We’ve gone from the longest format back to the +intermediate and we didn’t introduce any errors in our code.


Now let’s convert the long all the way back to the wide. In the wide +format, we will keep country and continent as ID variables and pivot the +observations across the 3 metrics +(pop,lifeExp,gdpPercap) and time +(year). First we need to create appropriate labels for all +our new variables (time*metric combinations) and we also need to unify +our ID variables to simplify the process of defining +gap_wide.


R +

+gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ var_ID    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ obs_type  : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
+ $ year      : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

R +

+gap_temp <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_")


tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
+ $ ID_var    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

Using unite() we now have a single ID variable which is +a combination of continent,country,and we have +defined variable names. We’re now ready to pipe in +pivot_wider()


R +

+gap_wide_new <- gap_long %>%
+  unite(ID_var, continent, country, sep = "_") %>%
+  unite(var_names, obs_type, year, sep = "_") %>%
+  pivot_wider(names_from = var_names, values_from = obs_values)


tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
+ $ ID_var        : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+ +

Challenge 3 +


Take this 1 step further and create a +gap_ludicrously_wide format data by pivoting over +countries, year and the 3 metrics? Hint this new data +frame should only have 5 rows.

+ +

R +

+gap_ludicrously_wide <- gap_long %>%
+   unite(var_names, obs_type, year, country, sep = "_") %>%
+   pivot_wider(names_from = var_names, values_from = obs_values)

Now we have a great ‘wide’ format data frame, but the +ID_var could be more usable, let’s separate it into 2 +variables with separate()


R +

+gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
+gap_wide_betterID <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_") %>%
+    pivot_wider(names_from = var_names, values_from = obs_values) %>%
+    separate(ID_var, c("continent","country"), sep = "_")


tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
+ $ continent     : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...

R +

+all.equal(gap_wide, gap_wide_betterID)


[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

There and back again!


Other great resources +

+ +
+ +

Keypoints +

  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +

Content from Producing Reports With knitr


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I integrate software and reports?
  • +


  • Understand the value of writing reproducible reports
  • +
  • Learn how to recognise and compile the basic components of an R +Markdown file
  • +
  • Become familiar with R code chunks, and understand their purpose, +structure and options
  • +
  • Demonstrate the use of inline chunks for weaving R outputs into text +blocks, for example when discussing the results of some +calculations
  • +
  • Be aware of alternative output formats to which an R Markdown file +can be exported
  • +

Data analysis reports +


Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their work +for future reference.


Many new users begin by first writing a single R script containing +all of their work, and then share the analysis by emailing the script +and various graphs as attachments. But this can be cumbersome, requiring +a lengthy discussion to explain which attachment was which result.


Writing formal reports with Word or LaTeX can simplify this +process by incorporating both the analysis report and output graphs into +a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy +“whack-a-mole” game of fixing new mistakes resulting from a single +formatting change.


Creating a report as a web page (which is an html file) using R +Markdown makes things easier. The report can be one long stream, so tall +figures that wouldn’t ordinarily fit on one page can be kept at full +size and easier to read, since the reader can simply keep scrolling. +Additionally, the formatting of and R Markdown document is simple and +easy to modify, allowing you to spend more time on your analyses instead +of writing reports.


Literate programming +


Ideally, such analysis reports are reproducible documents: +If an error is discovered, or if some additional subjects are added to +the data, you can just re-compile the report and get the new or +corrected results rather than having to reconstruct figures, paste them +into a Word document, and hand-edit various detailed results.


The key R package here is knitr. It allows you +to create a document that is a mixture of text and chunks of code. When +the document is processed by knitr, chunks of code will be +executed, and graphs or other results will be inserted into the final +document.


This sort of idea has been called “literate programming”.


knitr allows you to mix basically any type of text with +code from different programming languages, but we recommend that you use +R Markdown, which mixes Markdown with R. Markdown is a light-weight +mark-up language for creating web pages.


Creating an R Markdown file +


Within RStudio, click File → New File → R Markdown and you’ll get a +dialog box like this:

Screenshot of the New R Markdown file dialogue box in RStudio

You can stick with the default (HTML output), but give it a +title.


Basic components of R Markdown +


The initial chunk of text (header) contains instructions for R to +specify what kind of document will be created, and the options chosen. +You can use the header to give your document a title, author, date, and +tell it what type of output you want to produce. In this case, we’re +creating an html document.

+title: "Initial R Markdown document"
+author: "Karl Broman"
+date: "April 23, 2015"
+output: html_document

You can delete any of those fields if you don’t want them included. +The double-quotes aren’t strictly necessary in this case. +They’re mostly needed if you want to include a colon in the title.


RStudio creates the document with some example text to get you +started. Note below that there are chunks like


These are chunks of R code that will be executed by +knitr and replaced by their results. More on this +later.


Markdown +


Markdown is a system for writing web pages by marking up the text +much as you would in an email rather than writing html code. The +marked-up text gets converted to html, replacing the marks with +the proper html code.


For now, let’s delete all of the stuff that’s there and write a bit +of markdown.


You make things bold using two asterisks, like this: +**bold**, and you make things italics by using +underscores, like this: _italics_.


You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this:

A list:
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks

or like this:

A second list:
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks

Each will appear as:

  • bold with double-asterisks
  • +
  • italics with underscores
  • +
  • code-type font with backticks
  • +

You can use whatever method you prefer, but be consistent. +This maintains the readability of your code.


You can make a numbered list by just using numbers. You can even use +the same number over and over if you want:

1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks

This will appear as:

  1. bold with double-asterisks
  2. +
  3. italics with underscores
  4. +
  5. code-type font with backticks
  6. +

You can make section headers of different sizes by initiating a line +with some number of # symbols:

# Title
+## Main section
+### Sub-section
+#### Sub-sub section

You compile the R Markdown document to an html webpage by +clicking the “Knit” button in the upper-left.

+ +

Challenge 1 +


Create a new R Markdown document. Delete all of the R code chunks and +write a bit of Markdown (some sections, some italicized text, and an +itemized list).


Convert the document to a webpage.

+ +

In RStudio, select File > New file > R Markdown…


Delete the placeholder text and add the following:

# Introduction
+## Background on Data
+This report uses the *gapminder* dataset, which has columns that include:
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+## Background on Methods

Then click the ‘Knit’ button on the toolbar to generate an html +document (webpage).


A bit more Markdown +


You can make a hyperlink like this: +[Carpentries Home Page](https://carpentries.org/).


You can include an image file like this: +![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)


You can do subscripts (e.g., F2) with F~2~ +and superscripts (e.g., F2) with F^2^.


If you know how to write equations in LaTeX, you can use +$ $ and $$ $$ to insert math equations, like +$E = mc^2$ and

$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$

You can review Markdown syntax by navigating to the “Markdown Quick +Reference” under the “Help” field in the toolbar at the top of +RStudio.


R code chunks +


The real power of Markdown comes from mixing markdown with chunks of +code. This is R Markdown. When processed, the R code will be executed; +if they produce figures, the figures will be inserted in the final +document.


The main code chunks look like this:

+```{r load_data}

That is, you place a chunk of R code between ```{r +chunk_name} and ```. You should give each chunk a +unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the +shortcuts Ctrl+Alt+I on Windows and +Linux, or Cmd+Option+I on Mac.

+ +

Challenge 2 +


Add code chunks to:

  • Load the ggplot2 package
  • +
  • Read the gapminder data
  • +
  • Create a plot
  • +
+ +
+```{r load-ggplot2}
+```{r read-gapminder-data}
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)

How things get compiled +


When you press the “Knit” button, the R Markdown document is +processed by knitr +and a plain Markdown document is produced (as well as, potentially, a +set of figure files): the R code is executed and replaced by both the +input and the output; if figures are produced, links to those figures +are included.


The Markdown and figure documents are then processed by the tool pandoc, which converts the +Markdown file into an html file, with the figures embedded.


Chunk options +


There are a variety of options to affect how the code chunks are +treated. Here are some examples:

  • Use echo=FALSE to avoid having the code itself +shown.
  • +
  • Use results="hide" to avoid having any results +printed.
  • +
  • Use eval=FALSE to have the code shown but not +evaluated.
  • +
  • Use warning=FALSE and message=FALSE to +hide any warnings or messages produced.
  • +
  • Use fig.height and fig.width to control +the size of the figures produced (in inches).
  • +

So you might write:

+```{r load_libraries, echo=FALSE, message=FALSE}

Often there will be particular options that you’ll want to use +repeatedly; for this, you can set global chunk options, like +so:

+```{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)

The fig.path option defines where the figures will be +saved. The / here is really important; without it, the +figures would be saved in the standard place but just with names that +begin with Figs.


If you have multiple R Markdown files in a common directory, you +might want to use fig.path to define separate prefixes for +the figure file names, like fig.path="Figs/cleaning-" and +fig.path="Figs/analysis-".

+ +

Challenge 3 +


Use chunk options to control the size of a figure and to hide the +code.

+ +
+```{r echo = FALSE, fig.width = 3}

You can review all of the R chunk options by navigating +to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the +“Help” field in the toolbar at the top of RStudio.


Inline R code +


You can make every number in your report reproducible. Use +`r and ` for an in-line code chunk, like so: +`r round(some_value, 2)`. The code will be executed and +replaced with the value of the result.


Don’t let these in-line chunks get split across lines.


Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with include=FALSE for +that larger chunk (which is the same as echo=FALSE and +results="hide").


Rounding can produce differences in output in such situations. You +may want 2.0, but round(2.03, 1) will give +just 2.


The myround +function in the R/broman +package handles this.

+ +

Challenge 4 +


Try out a bit of in-line R code.

+ +

Here’s some inline code to determine that 2 + 2 = 4.


Other output options +


You can also convert R Markdown to a PDF or a Word document. Click +the little triangle next to the “Knit” button to get a drop-down menu. +Or you could put pdf_document or word_document +in the initial header of the file.

+ +

Tip: Creating PDF documents +


Creating .pdf documents may require installation of some extra +software. The R package tinytex provides some tools to help +make this process easier for R users. With tinytex +installed, run tinytex::install_tinytex() to install the +required software (you’ll only need to do this once) and then when you +knit to pdf tinytex will automatically detect and install +any additional LaTeX packages that are needed to produce the pdf +document. Visit the tinytex +website for more information.

+ +

Tip: Visual markdown editing in RStudio +


RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.


Resources +

+ +
+ +

Keypoints +

  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +

Content from Writing Good Software


Last updated on 2023-10-26 | + + Edit this page

+ +




  • How can I write software that other people can use?
  • +


  • Describe best practices for writing R and explain the justification +for each.
  • +

Structure your project folder +


Keep your project folder structured, organized and tidy, by creating +subfolders for your code files, manuals, data, binaries, output plots, +etc. It can be done completely manually, or with the help of RStudio’s +New Project functionality, or a designated package, such as +ProjectTemplate.

+ +

Tip: ProjectTemplate - a possible +solution +


One way to automate the management of projects is to install the +third-party package, ProjectTemplate. This package will set +up an ideal directory structure for project management. This is very +useful as it enables you to have your analysis pipeline/workflow +organised and structured. Together with the default RStudio project +functionality and Git you will be able to keep track of your work as +well as be able to share your work with collaborators.

  1. Install ProjectTemplate.
  2. +
  3. Load the library
  4. +
  5. Initialise the project:
  6. +

R +

+create.project("../my_project_2", merge.strategy = "allow.non.conflict")

For more information on ProjectTemplate and its functionality visit +the home page ProjectTemplate


Make code readable +


The most important part of writing code is making it readable and +understandable. You want someone else to be able to pick up your code +and be able to understand what it does: more often than not this someone +will be you 6 months down the line, who will otherwise be cursing +past-self.


Documentation: tell us what and why, not how +


When you first start out, your comments will often describe what a +command does, since you’re still learning yourself and it can help to +clarify concepts and remind you later. However, these comments aren’t +particularly useful later on when you don’t remember what problem your +code is trying to solve. Try to also include comments that tell you +why you’re solving a problem, and what problem that +is. The how can come after that: it’s an implementation detail +you ideally shouldn’t have to worry about.


Keep your code modular +


Our recommendation is that you should separate your functions from +your analysis scripts, and store them in a separate file that you +source when you open the R session in your project. This +approach is nice because it leaves you with an uncluttered analysis +script, and a repository of useful functions that can be loaded into any +analysis script in your project. It also lets you group related +functions together easily.


Break down problem into bite size pieces +


When you first start out, problem solving and function writing can be +daunting tasks, and hard to separate from code inexperience. Try to +break down your problem into digestible chunks and worry about the +implementation details later: keep breaking down the problem into +smaller and smaller functions until you reach a point where you can code +a solution, and build back up from there.


Know that your code is doing the right thing +


Make sure to test your functions!


Don’t repeat yourself +


Functions enable easy reuse within a project. If you see blocks of +similar lines of code through your project, those are usually candidates +for being moved into functions.


If your calculations are performed through a series of functions, +then the project becomes more modular and easier to change. This is +especially the case for which a particular input always gives a +particular output.


Remember to be stylish +


Apply consistent style to your code.

+ +

Keypoints +

  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+ + +
+ + +
+ +
Back To Top +
+ + + + diff --git a/android-chrome-192x192.png b/android-chrome-192x192.png new file mode 100644 index 000000000..ed3c210ab Binary files /dev/null and b/android-chrome-192x192.png differ diff --git a/android-chrome-512x512.png b/android-chrome-512x512.png new file mode 100644 index 000000000..c88d96c1c Binary files /dev/null and b/android-chrome-512x512.png differ diff --git a/apple-touch-icon.png b/apple-touch-icon.png new file mode 100644 index 000000000..8044feefd Binary files /dev/null and b/apple-touch-icon.png differ diff --git a/assets/fonts/Mulish-Bold.ttf b/assets/fonts/Mulish-Bold.ttf new file mode 100644 index 000000000..1f522d476 Binary files /dev/null and b/assets/fonts/Mulish-Bold.ttf differ diff --git a/assets/fonts/Mulish-Bold.woff b/assets/fonts/Mulish-Bold.woff new file mode 100644 index 000000000..711448ea9 Binary files /dev/null and b/assets/fonts/Mulish-Bold.woff differ diff --git a/assets/fonts/Mulish-ExtraBold.ttf b/assets/fonts/Mulish-ExtraBold.ttf new file mode 100644 index 000000000..62850fff3 Binary files /dev/null and b/assets/fonts/Mulish-ExtraBold.ttf differ diff --git a/assets/fonts/mulish-v5-latin-regular.eot b/assets/fonts/mulish-v5-latin-regular.eot new file mode 100644 index 000000000..423bcb17a Binary files /dev/null and b/assets/fonts/mulish-v5-latin-regular.eot differ diff --git a/assets/fonts/mulish-v5-latin-regular.svg b/assets/fonts/mulish-v5-latin-regular.svg new file mode 100644 index 000000000..70341f98b --- /dev/null +++ b/assets/fonts/mulish-v5-latin-regular.svg @@ -0,0 +1,305 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/assets/fonts/mulish-v5-latin-regular.ttf b/assets/fonts/mulish-v5-latin-regular.ttf new file mode 100644 index 000000000..541bb406e Binary files /dev/null and b/assets/fonts/mulish-v5-latin-regular.ttf differ diff --git a/assets/fonts/mulish-v5-latin-regular.woff b/assets/fonts/mulish-v5-latin-regular.woff new file mode 100644 index 000000000..700ec13f5 Binary files /dev/null and b/assets/fonts/mulish-v5-latin-regular.woff differ diff --git a/assets/fonts/mulish-v5-latin-regular.woff2 b/assets/fonts/mulish-v5-latin-regular.woff2 new file mode 100644 index 000000000..b244298bf Binary files /dev/null and b/assets/fonts/mulish-v5-latin-regular.woff2 differ diff --git a/assets/fonts/mulish-variablefont_wght.woff b/assets/fonts/mulish-variablefont_wght.woff new file mode 100644 index 000000000..fc425383a Binary files /dev/null and b/assets/fonts/mulish-variablefont_wght.woff differ diff --git a/assets/fonts/mulish-variablefont_wght.woff2 b/assets/fonts/mulish-variablefont_wght.woff2 new file mode 100644 index 000000000..8a233c6f9 Binary files /dev/null and b/assets/fonts/mulish-variablefont_wght.woff2 differ diff --git a/assets/images/carpentries-logo-sm.svg b/assets/images/carpentries-logo-sm.svg new file mode 100644 index 000000000..da70d40ee --- /dev/null +++ b/assets/images/carpentries-logo-sm.svg @@ -0,0 +1,7 @@ + + + + + + + \ No newline at end of file diff --git a/assets/images/carpentries-logo.svg b/assets/images/carpentries-logo.svg new file mode 100644 index 000000000..6cbe66500 --- /dev/null +++ b/assets/images/carpentries-logo.svg @@ -0,0 +1,19 @@ + + + + + + + + + + + + + + + + + + + diff --git a/assets/images/data-logo-sm.svg b/assets/images/data-logo-sm.svg new file mode 100644 index 000000000..cf489be84 --- /dev/null +++ b/assets/images/data-logo-sm.svg @@ -0,0 +1,5 @@ + + + + + diff --git a/assets/images/data-logo.svg b/assets/images/data-logo.svg new file mode 100644 index 000000000..cf489be84 --- /dev/null +++ b/assets/images/data-logo.svg @@ -0,0 +1,5 @@ + + + + + diff --git a/assets/images/dropdown-arrow.svg b/assets/images/dropdown-arrow.svg new file mode 100644 index 000000000..a12b04b34 --- /dev/null +++ b/assets/images/dropdown-arrow.svg @@ -0,0 +1,12 @@ + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +



Last updated on 2023-10-26 | + + Edit this page

+ + + +
+ +
+ + +

Please see our other R +lesson for a different presentation of these concepts.

+ + +
+ + +
+ + + diff --git a/docsearch.css b/docsearch.css new file mode 100644 index 000000000..e5f1fe1df --- /dev/null +++ b/docsearch.css @@ -0,0 +1,148 @@ +/* Docsearch -------------------------------------------------------------- */ +/* + Source: https://github.com/algolia/docsearch/ + License: MIT +*/ + +.algolia-autocomplete { + display: block; + -webkit-box-flex: 1; + -ms-flex: 1; + flex: 1 +} + +.algolia-autocomplete .ds-dropdown-menu { + width: 100%; + min-width: none; + max-width: none; + padding: .75rem 0; + background-color: #fff; + background-clip: padding-box; + border: 1px solid rgba(0, 0, 0, .1); + box-shadow: 0 .5rem 1rem rgba(0, 0, 0, .175); +} + +@media (min-width:768px) { + .algolia-autocomplete .ds-dropdown-menu { + width: 175% + } +} + +.algolia-autocomplete .ds-dropdown-menu::before { + display: none +} + +.algolia-autocomplete .ds-dropdown-menu [class^=ds-dataset-] { + padding: 0; + background-color: rgb(255,255,255); + border: 0; + max-height: 80vh; +} + +.algolia-autocomplete .ds-dropdown-menu .ds-suggestions { + margin-top: 0 +} + +.algolia-autocomplete .algolia-docsearch-suggestion { + padding: 0; + overflow: visible +} + +.algolia-autocomplete .algolia-docsearch-suggestion--category-header { + padding: .125rem 1rem; + margin-top: 0; + font-size: 1.3em; + font-weight: 500; + color: #00008B; + border-bottom: 0 +} + +.algolia-autocomplete .algolia-docsearch-suggestion--wrapper { + float: none; + padding-top: 0 +} + +.algolia-autocomplete .algolia-docsearch-suggestion--subcategory-column { + float: none; + width: auto; + padding: 0; + text-align: left +} + +.algolia-autocomplete .algolia-docsearch-suggestion--content { + float: none; + width: auto; + padding: 0 +} + +.algolia-autocomplete .algolia-docsearch-suggestion--content::before { + display: none +} + +.algolia-autocomplete .ds-suggestion:not(:first-child) .algolia-docsearch-suggestion--category-header { + padding-top: .75rem; + margin-top: .75rem; + border-top: 1px solid rgba(0, 0, 0, .1) +} + +.algolia-autocomplete .ds-suggestion .algolia-docsearch-suggestion--subcategory-column { + display: block; + padding: .1rem 1rem; + margin-bottom: 0.1; + font-size: 1.0em; + font-weight: 400 + /* display: none */ +} + +.algolia-autocomplete .algolia-docsearch-suggestion--title { + display: block; + padding: .25rem 1rem; + margin-bottom: 0; + font-size: 0.9em; + font-weight: 400 +} + +.algolia-autocomplete .algolia-docsearch-suggestion--text { + padding: 0 1rem .5rem; + margin-top: -.25rem; + font-size: 0.8em; + font-weight: 400; + line-height: 1.25 +} + +.algolia-autocomplete .algolia-docsearch-footer { + width: 110px; + height: 20px; + z-index: 3; + margin-top: 10.66667px; + float: right; + font-size: 0; + line-height: 0; +} + +.algolia-autocomplete .algolia-docsearch-footer--logo { + background-image: url("data:image/svg+xml;utf8,"); + background-repeat: no-repeat; + background-position: 50%; + background-size: 100%; + overflow: hidden; + text-indent: -9000px; + width: 100%; + height: 100%; + display: block; + transform: translate(-8px); +} + +.algolia-autocomplete .algolia-docsearch-suggestion--highlight { + color: #FF8C00; + background: rgba(232, 189, 54, 0.1) +} + + +.algolia-autocomplete .algolia-docsearch-suggestion--text .algolia-docsearch-suggestion--highlight { + box-shadow: inset 0 -2px 0 0 rgba(105, 105, 105, .5) +} + +.algolia-autocomplete .ds-suggestion.ds-cursor .algolia-docsearch-suggestion--content { + background-color: rgba(192, 192, 192, .15) +} diff --git a/docsearch.js b/docsearch.js new file mode 100644 index 000000000..b35504cd3 --- /dev/null +++ b/docsearch.js @@ -0,0 +1,85 @@ +$(function() { + + // register a handler to move the focus to the search bar + // upon pressing shift + "/" (i.e. "?") + $(document).on('keydown', function(e) { + if (e.shiftKey && e.keyCode == 191) { + e.preventDefault(); + $("#search-input").focus(); + } + }); + + $(document).ready(function() { + // do keyword highlighting + /* modified from https://jsfiddle.net/julmot/bL6bb5oo/ */ + var mark = function() { + + var referrer = document.URL ; + var paramKey = "q" ; + + if (referrer.indexOf("?") !== -1) { + var qs = referrer.substr(referrer.indexOf('?') + 1); + var qs_noanchor = qs.split('#')[0]; + var qsa = qs_noanchor.split('&'); + var keyword = ""; + + for (var i = 0; i < qsa.length; i++) { + var currentParam = qsa[i].split('='); + + if (currentParam.length !== 2) { + continue; + } + + if (currentParam[0] == paramKey) { + keyword = decodeURIComponent(currentParam[1].replace(/\+/g, "%20")); + } + } + + if (keyword !== "") { + $(".contents").unmark({ + done: function() { + $(".contents").mark(keyword); + } + }); + } + } + }; + + mark(); + }); +}); + +/* Search term highlighting ------------------------------*/ + +function matchedWords(hit) { + var words = []; + + var hierarchy = hit._highlightResult.hierarchy; + // loop to fetch from lvl0, lvl1, etc. + for (var idx in hierarchy) { + words = words.concat(hierarchy[idx].matchedWords); + } + + var content = hit._highlightResult.content; + if (content) { + words = words.concat(content.matchedWords); + } + + // return unique words + var words_uniq = [...new Set(words)]; + return words_uniq; +} + +function updateHitURL(hit) { + + var words = matchedWords(hit); + var url = ""; + + if (hit.anchor) { + url = hit.url_without_anchor + '?q=' + escape(words.join(" ")) + '#' + hit.anchor; + } else { + url = hit.url + '?q=' + escape(words.join(" ")); + } + + return url; +} diff --git a/favicon-16x16.png b/favicon-16x16.png new file mode 100644 index 000000000..d44f8acb4 Binary files /dev/null and b/favicon-16x16.png differ diff --git a/favicon-32x32.png b/favicon-32x32.png new file mode 100644 index 000000000..63441d4c3 Binary files /dev/null and b/favicon-32x32.png differ diff --git a/favicons/cp/apple-touch-icon-114x114.png b/favicons/cp/apple-touch-icon-114x114.png new file mode 100644 index 000000000..a60b75810 Binary files /dev/null and b/favicons/cp/apple-touch-icon-114x114.png differ diff --git a/favicons/cp/apple-touch-icon-120x120.png b/favicons/cp/apple-touch-icon-120x120.png new file mode 100644 index 000000000..8f20a8f12 Binary files /dev/null and b/favicons/cp/apple-touch-icon-120x120.png differ diff --git a/favicons/cp/apple-touch-icon-144x144.png b/favicons/cp/apple-touch-icon-144x144.png new file mode 100644 index 000000000..4be151b14 Binary files /dev/null and b/favicons/cp/apple-touch-icon-144x144.png differ diff --git a/favicons/cp/apple-touch-icon-152x152.png b/favicons/cp/apple-touch-icon-152x152.png new file mode 100644 index 000000000..7d1d94395 Binary files /dev/null and b/favicons/cp/apple-touch-icon-152x152.png differ diff --git a/favicons/cp/apple-touch-icon-57x57.png b/favicons/cp/apple-touch-icon-57x57.png new file mode 100644 index 000000000..92309cef2 Binary files /dev/null and b/favicons/cp/apple-touch-icon-57x57.png differ diff --git a/favicons/cp/apple-touch-icon-60x60.png b/favicons/cp/apple-touch-icon-60x60.png new file mode 100644 index 000000000..de8148e58 Binary files /dev/null and b/favicons/cp/apple-touch-icon-60x60.png differ diff --git a/favicons/cp/apple-touch-icon-72x72.png b/favicons/cp/apple-touch-icon-72x72.png new file mode 100644 index 000000000..81d7e3d83 Binary files /dev/null and b/favicons/cp/apple-touch-icon-72x72.png differ diff --git a/favicons/cp/apple-touch-icon-76x76.png b/favicons/cp/apple-touch-icon-76x76.png new file mode 100644 index 000000000..15bca5c77 Binary files /dev/null and b/favicons/cp/apple-touch-icon-76x76.png differ diff --git a/favicons/cp/favicon-128.png b/favicons/cp/favicon-128.png new file mode 100644 index 000000000..e612cdc15 Binary files /dev/null and b/favicons/cp/favicon-128.png differ diff --git a/favicons/cp/favicon-16x16.png b/favicons/cp/favicon-16x16.png new file mode 100644 index 000000000..65b331112 Binary files /dev/null and b/favicons/cp/favicon-16x16.png differ diff --git a/favicons/cp/favicon-196x196.png b/favicons/cp/favicon-196x196.png new file mode 100644 index 000000000..0da938b27 Binary files /dev/null and b/favicons/cp/favicon-196x196.png differ diff --git a/favicons/cp/favicon-32x32.png b/favicons/cp/favicon-32x32.png new file mode 100644 index 000000000..0c1442e39 Binary files /dev/null and b/favicons/cp/favicon-32x32.png differ diff --git a/favicons/cp/favicon-96x96.png b/favicons/cp/favicon-96x96.png new file mode 100644 index 000000000..bed74ec8d Binary files /dev/null and b/favicons/cp/favicon-96x96.png differ diff --git a/favicons/cp/favicon.ico b/favicons/cp/favicon.ico new file mode 100644 index 000000000..4f2f2f11f Binary files /dev/null and b/favicons/cp/favicon.ico differ diff --git a/favicons/cp/mstile-144x144.png b/favicons/cp/mstile-144x144.png new file mode 100644 index 000000000..4be151b14 Binary files /dev/null and b/favicons/cp/mstile-144x144.png differ diff --git a/favicons/cp/mstile-150x150.png b/favicons/cp/mstile-150x150.png new file mode 100644 index 000000000..bf7ad5e79 Binary files /dev/null and b/favicons/cp/mstile-150x150.png differ diff --git a/favicons/cp/mstile-310x150.png b/favicons/cp/mstile-310x150.png new file mode 100644 index 000000000..6ac804843 Binary files /dev/null and b/favicons/cp/mstile-310x150.png differ diff --git a/favicons/cp/mstile-310x310.png b/favicons/cp/mstile-310x310.png new file mode 100644 index 000000000..b77814750 Binary files /dev/null and b/favicons/cp/mstile-310x310.png differ diff --git a/favicons/cp/mstile-70x70.png b/favicons/cp/mstile-70x70.png new file mode 100644 index 000000000..e612cdc15 Binary files /dev/null and b/favicons/cp/mstile-70x70.png differ diff --git a/favicons/dc/apple-touch-icon-114x114.png b/favicons/dc/apple-touch-icon-114x114.png new file mode 100644 index 000000000..edafbda13 Binary files /dev/null and b/favicons/dc/apple-touch-icon-114x114.png differ diff --git a/favicons/dc/apple-touch-icon-120x120.png b/favicons/dc/apple-touch-icon-120x120.png new file mode 100644 index 000000000..ee145ec5c Binary files /dev/null and b/favicons/dc/apple-touch-icon-120x120.png differ diff --git a/favicons/dc/apple-touch-icon-144x144.png b/favicons/dc/apple-touch-icon-144x144.png new file mode 100644 index 000000000..bf5070144 Binary files /dev/null and b/favicons/dc/apple-touch-icon-144x144.png differ diff --git a/favicons/dc/apple-touch-icon-152x152.png b/favicons/dc/apple-touch-icon-152x152.png new file mode 100644 index 000000000..bd596c816 Binary files /dev/null and b/favicons/dc/apple-touch-icon-152x152.png differ diff --git a/favicons/dc/apple-touch-icon-57x57.png b/favicons/dc/apple-touch-icon-57x57.png new file mode 100644 index 000000000..61c152735 Binary files /dev/null and b/favicons/dc/apple-touch-icon-57x57.png differ diff --git a/favicons/dc/apple-touch-icon-60x60.png b/favicons/dc/apple-touch-icon-60x60.png new file mode 100644 index 000000000..9daad3633 Binary files /dev/null and b/favicons/dc/apple-touch-icon-60x60.png differ diff --git a/favicons/dc/apple-touch-icon-72x72.png b/favicons/dc/apple-touch-icon-72x72.png new file mode 100644 index 000000000..2069520fc Binary files /dev/null and b/favicons/dc/apple-touch-icon-72x72.png differ diff --git a/favicons/dc/apple-touch-icon-76x76.png b/favicons/dc/apple-touch-icon-76x76.png new file mode 100644 index 000000000..3db01ca7d Binary files /dev/null and b/favicons/dc/apple-touch-icon-76x76.png differ diff --git a/favicons/dc/favicon-128.png b/favicons/dc/favicon-128.png new file mode 100644 index 000000000..9e3de2a49 Binary files /dev/null and b/favicons/dc/favicon-128.png differ diff --git a/favicons/dc/favicon-16x16.png b/favicons/dc/favicon-16x16.png new file mode 100644 index 000000000..4c9f9b8c5 Binary files /dev/null and b/favicons/dc/favicon-16x16.png differ diff --git a/favicons/dc/favicon-196x196.png b/favicons/dc/favicon-196x196.png new file mode 100644 index 000000000..588afc213 Binary files /dev/null and b/favicons/dc/favicon-196x196.png differ diff --git a/favicons/dc/favicon-32x32.png b/favicons/dc/favicon-32x32.png new file mode 100644 index 000000000..9c2ecbfbe Binary files /dev/null and b/favicons/dc/favicon-32x32.png differ diff --git a/favicons/dc/favicon-96x96.png b/favicons/dc/favicon-96x96.png new file mode 100644 index 000000000..ff13fc06e Binary files /dev/null and b/favicons/dc/favicon-96x96.png differ diff --git a/favicons/dc/favicon.ico b/favicons/dc/favicon.ico new file mode 100644 index 000000000..e4715f329 Binary files /dev/null and b/favicons/dc/favicon.ico differ diff --git a/favicons/dc/mstile-144x144.png b/favicons/dc/mstile-144x144.png new file mode 100644 index 000000000..bf5070144 Binary files /dev/null and b/favicons/dc/mstile-144x144.png differ diff --git a/favicons/dc/mstile-150x150.png b/favicons/dc/mstile-150x150.png new file mode 100644 index 000000000..c5844cca3 Binary files /dev/null and b/favicons/dc/mstile-150x150.png differ diff --git a/favicons/dc/mstile-310x150.png b/favicons/dc/mstile-310x150.png new file mode 100644 index 000000000..786813af8 Binary files /dev/null and b/favicons/dc/mstile-310x150.png differ diff --git a/favicons/dc/mstile-310x310.png b/favicons/dc/mstile-310x310.png new file mode 100644 index 000000000..9580653c6 Binary files /dev/null and b/favicons/dc/mstile-310x310.png differ diff --git a/favicons/dc/mstile-70x70.png b/favicons/dc/mstile-70x70.png new file mode 100644 index 000000000..9e3de2a49 Binary files /dev/null and b/favicons/dc/mstile-70x70.png differ diff --git a/favicons/lc/apple-touch-icon-114x114.png b/favicons/lc/apple-touch-icon-114x114.png new file mode 100644 index 000000000..6c83127ca Binary files /dev/null and b/favicons/lc/apple-touch-icon-114x114.png differ diff --git a/favicons/lc/apple-touch-icon-120x120.png b/favicons/lc/apple-touch-icon-120x120.png new file mode 100644 index 000000000..8334648f1 Binary files /dev/null and b/favicons/lc/apple-touch-icon-120x120.png differ diff --git a/favicons/lc/apple-touch-icon-144x144.png b/favicons/lc/apple-touch-icon-144x144.png new file mode 100644 index 000000000..5f32151ed Binary files /dev/null and b/favicons/lc/apple-touch-icon-144x144.png differ diff --git a/favicons/lc/apple-touch-icon-152x152.png b/favicons/lc/apple-touch-icon-152x152.png new file mode 100644 index 000000000..4e5c177ce Binary files /dev/null and b/favicons/lc/apple-touch-icon-152x152.png differ diff --git a/favicons/lc/apple-touch-icon-57x57.png b/favicons/lc/apple-touch-icon-57x57.png new file mode 100644 index 000000000..61f9c9c74 Binary files /dev/null and b/favicons/lc/apple-touch-icon-57x57.png differ diff --git a/favicons/lc/apple-touch-icon-60x60.png b/favicons/lc/apple-touch-icon-60x60.png new file mode 100644 index 000000000..ccb5ada1c Binary files /dev/null and b/favicons/lc/apple-touch-icon-60x60.png differ diff --git a/favicons/lc/apple-touch-icon-72x72.png b/favicons/lc/apple-touch-icon-72x72.png new file mode 100644 index 000000000..517d459af Binary files /dev/null and b/favicons/lc/apple-touch-icon-72x72.png differ diff --git a/favicons/lc/apple-touch-icon-76x76.png b/favicons/lc/apple-touch-icon-76x76.png new file mode 100644 index 000000000..17454b311 Binary files /dev/null and b/favicons/lc/apple-touch-icon-76x76.png differ diff --git a/favicons/lc/favicon-128.png b/favicons/lc/favicon-128.png new file mode 100644 index 000000000..9d781c901 Binary files /dev/null and b/favicons/lc/favicon-128.png differ diff --git a/favicons/lc/favicon-16x16.png b/favicons/lc/favicon-16x16.png new file mode 100644 index 000000000..3c20abcc0 Binary files /dev/null and b/favicons/lc/favicon-16x16.png differ diff --git a/favicons/lc/favicon-196x196.png b/favicons/lc/favicon-196x196.png new file mode 100644 index 000000000..46baaf8f9 Binary files /dev/null and b/favicons/lc/favicon-196x196.png differ diff --git a/favicons/lc/favicon-32x32.png b/favicons/lc/favicon-32x32.png new file mode 100644 index 000000000..ed6701ea1 Binary files /dev/null and b/favicons/lc/favicon-32x32.png differ diff --git a/favicons/lc/favicon-96x96.png b/favicons/lc/favicon-96x96.png new file mode 100644 index 000000000..bc468c73a Binary files /dev/null and b/favicons/lc/favicon-96x96.png differ diff --git a/favicons/lc/favicon.ico b/favicons/lc/favicon.ico new file mode 100644 index 000000000..5c14e8091 Binary files /dev/null and b/favicons/lc/favicon.ico differ diff --git a/favicons/lc/mstile-144x144.png b/favicons/lc/mstile-144x144.png new file mode 100644 index 000000000..5f32151ed Binary files /dev/null and b/favicons/lc/mstile-144x144.png differ diff --git a/favicons/lc/mstile-150x150.png b/favicons/lc/mstile-150x150.png new file mode 100644 index 000000000..924953a84 Binary files /dev/null and b/favicons/lc/mstile-150x150.png differ diff --git a/favicons/lc/mstile-310x150.png b/favicons/lc/mstile-310x150.png new file mode 100644 index 000000000..e4dcda444 Binary files /dev/null and b/favicons/lc/mstile-310x150.png differ diff --git a/favicons/lc/mstile-310x310.png b/favicons/lc/mstile-310x310.png new file mode 100644 index 000000000..a12c87632 Binary files /dev/null and b/favicons/lc/mstile-310x310.png differ diff --git a/favicons/lc/mstile-70x70.png b/favicons/lc/mstile-70x70.png new file mode 100644 index 000000000..9d781c901 Binary files /dev/null and b/favicons/lc/mstile-70x70.png differ diff --git a/favicons/swc/apple-touch-icon-114x114.png b/favicons/swc/apple-touch-icon-114x114.png new file mode 100644 index 000000000..e5125f8c4 Binary files /dev/null and b/favicons/swc/apple-touch-icon-114x114.png differ diff --git a/favicons/swc/apple-touch-icon-120x120.png b/favicons/swc/apple-touch-icon-120x120.png new file mode 100644 index 000000000..0f97a0aec Binary files /dev/null and b/favicons/swc/apple-touch-icon-120x120.png differ diff --git a/favicons/swc/apple-touch-icon-144x144.png b/favicons/swc/apple-touch-icon-144x144.png new file mode 100644 index 000000000..7441446cc Binary files /dev/null and b/favicons/swc/apple-touch-icon-144x144.png differ diff --git a/favicons/swc/apple-touch-icon-152x152.png b/favicons/swc/apple-touch-icon-152x152.png new file mode 100644 index 000000000..45cc338e5 Binary files /dev/null and b/favicons/swc/apple-touch-icon-152x152.png differ diff --git a/favicons/swc/apple-touch-icon-57x57.png b/favicons/swc/apple-touch-icon-57x57.png new file mode 100644 index 000000000..e180a4a32 Binary files /dev/null and b/favicons/swc/apple-touch-icon-57x57.png differ diff --git a/favicons/swc/apple-touch-icon-60x60.png b/favicons/swc/apple-touch-icon-60x60.png new file mode 100644 index 000000000..c96fd6ce7 Binary files /dev/null and b/favicons/swc/apple-touch-icon-60x60.png differ diff --git a/favicons/swc/apple-touch-icon-72x72.png b/favicons/swc/apple-touch-icon-72x72.png new file mode 100644 index 000000000..aae014aa7 Binary files /dev/null and b/favicons/swc/apple-touch-icon-72x72.png differ diff --git a/favicons/swc/apple-touch-icon-76x76.png b/favicons/swc/apple-touch-icon-76x76.png new file mode 100644 index 000000000..2167f94a7 Binary files /dev/null and b/favicons/swc/apple-touch-icon-76x76.png differ diff --git a/favicons/swc/favicon-128.png b/favicons/swc/favicon-128.png new file mode 100644 index 000000000..f61df620c Binary files /dev/null and b/favicons/swc/favicon-128.png differ diff --git a/favicons/swc/favicon-16x16.png b/favicons/swc/favicon-16x16.png new file mode 100644 index 000000000..2d20a4061 Binary files /dev/null and b/favicons/swc/favicon-16x16.png differ diff --git a/favicons/swc/favicon-196x196.png b/favicons/swc/favicon-196x196.png new file mode 100644 index 000000000..2a20d3a6f Binary files /dev/null and b/favicons/swc/favicon-196x196.png differ diff --git a/favicons/swc/favicon-32x32.png b/favicons/swc/favicon-32x32.png new file mode 100644 index 000000000..f622b73a1 Binary files /dev/null and b/favicons/swc/favicon-32x32.png differ diff --git a/favicons/swc/favicon-96x96.png b/favicons/swc/favicon-96x96.png new file mode 100644 index 000000000..5e57f66a5 Binary files /dev/null and b/favicons/swc/favicon-96x96.png differ diff --git a/favicons/swc/favicon.ico b/favicons/swc/favicon.ico new file mode 100644 index 000000000..f771790f2 Binary files /dev/null and b/favicons/swc/favicon.ico differ diff --git a/favicons/swc/mstile-144x144.png b/favicons/swc/mstile-144x144.png new file mode 100644 index 000000000..7441446cc Binary files /dev/null and b/favicons/swc/mstile-144x144.png differ diff --git a/favicons/swc/mstile-150x150.png b/favicons/swc/mstile-150x150.png new file mode 100644 index 000000000..d1594bcb8 Binary files /dev/null and b/favicons/swc/mstile-150x150.png differ diff --git a/favicons/swc/mstile-310x150.png b/favicons/swc/mstile-310x150.png new file mode 100644 index 000000000..f7d58b2b9 Binary files /dev/null and b/favicons/swc/mstile-310x150.png differ diff --git a/favicons/swc/mstile-310x310.png b/favicons/swc/mstile-310x310.png new file mode 100644 index 000000000..b632b421c Binary files /dev/null and b/favicons/swc/mstile-310x310.png differ diff --git a/favicons/swc/mstile-70x70.png b/favicons/swc/mstile-70x70.png new file mode 100644 index 000000000..f61df620c Binary files /dev/null and b/favicons/swc/mstile-70x70.png differ diff --git a/fig/01-rstudio-script.png b/fig/01-rstudio-script.png new file mode 100644 index 000000000..babbd2949 Binary files /dev/null and b/fig/01-rstudio-script.png differ diff --git a/fig/01-rstudio.png b/fig/01-rstudio.png new file mode 100644 index 000000000..0840386af Binary files /dev/null and b/fig/01-rstudio.png differ diff --git a/fig/06-rmd-generate-figures.sh b/fig/06-rmd-generate-figures.sh new file mode 100755 index 000000000..4cc231322 --- /dev/null +++ b/fig/06-rmd-generate-figures.sh @@ -0,0 +1,7 @@ +inkscape --export-png=06-rmd-inequality.0.png 06-rmd-inequality.0.svg +# use ImageMagick to grab top and bottom halves +# (surely there's a better way ... too much space at the bottom of the first) +convert 06-rmd-inequality.0.png -crop 100%x50% tmp.png +mv tmp-0.png 06-rmd-inequality.1.png +mv tmp-1.png 06-rmd-inequality.2.png + diff --git a/fig/06-rmd-inequality.0.png b/fig/06-rmd-inequality.0.png new file mode 100644 index 000000000..aa6d3f1e9 Binary files /dev/null and b/fig/06-rmd-inequality.0.png differ diff --git a/fig/06-rmd-inequality.0.svg b/fig/06-rmd-inequality.0.svg new file mode 100644 index 000000000..b8953dcbe --- /dev/null +++ b/fig/06-rmd-inequality.0.svg @@ -0,0 +1,311 @@ + + + + + + + + + + image/svg+xml + + + + + + + c("a", "b", "c") + c("a", "c") + + + + + + + FALSE + TRUE + ?? + ?? + c("a", "b", "c") + c("a", "c") + + + + + + + FALSE + TRUE + c("a"... + FALSE + != + != + + diff --git a/fig/06-rmd-inequality.1.png b/fig/06-rmd-inequality.1.png new file mode 100644 index 000000000..580038505 Binary files /dev/null and b/fig/06-rmd-inequality.1.png differ diff --git a/fig/06-rmd-inequality.2.png b/fig/06-rmd-inequality.2.png new file mode 100644 index 000000000..d3f438dc1 Binary files /dev/null and b/fig/06-rmd-inequality.2.png differ diff --git a/fig/08-plot-ggplot2-rendered-axis-scale-1.png b/fig/08-plot-ggplot2-rendered-axis-scale-1.png new file mode 100644 index 000000000..3349b33e9 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-axis-scale-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-blank-ggplot-1.png b/fig/08-plot-ggplot2-rendered-blank-ggplot-1.png new file mode 100644 index 000000000..14c48d3bf Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-blank-ggplot-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch1-sol-1.png b/fig/08-plot-ggplot2-rendered-ch1-sol-1.png new file mode 100644 index 000000000..6dcaa2a72 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch1-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch2-sol-1.png b/fig/08-plot-ggplot2-rendered-ch2-sol-1.png new file mode 100644 index 000000000..4559e25e1 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch2-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch3-sol-1.png b/fig/08-plot-ggplot2-rendered-ch3-sol-1.png new file mode 100644 index 000000000..8af5f41fa Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch3-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch4a-sol-1.png b/fig/08-plot-ggplot2-rendered-ch4a-sol-1.png new file mode 100644 index 000000000..0f755f4c7 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch4a-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch4b-sol-1.png b/fig/08-plot-ggplot2-rendered-ch4b-sol-1.png new file mode 100644 index 000000000..e07c93aac Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch4b-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ch5-sol-1.png b/fig/08-plot-ggplot2-rendered-ch5-sol-1.png new file mode 100644 index 000000000..a2cd52f11 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ch5-sol-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-facet-1.png b/fig/08-plot-ggplot2-rendered-facet-1.png new file mode 100644 index 000000000..bc7c4d02c Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-facet-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-ggplot-with-aes-1.png b/fig/08-plot-ggplot2-rendered-ggplot-with-aes-1.png new file mode 100644 index 000000000..70214d8b0 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-ggplot-with-aes-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-layer-example-1-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-layer-example-1-1.png new file mode 100644 index 000000000..2238582b0 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-layer-example-1-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-line-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-line-1.png new file mode 100644 index 000000000..a915f00ba Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-line-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-line-by-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-line-by-1.png new file mode 100644 index 000000000..56157b9b6 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-line-by-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-line-point-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-line-point-1.png new file mode 100644 index 000000000..c02ce6add Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-line-point-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter-1.png new file mode 100644 index 000000000..44db5466d Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter3-1.png b/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter3-1.png new file mode 100644 index 000000000..44db5466d Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lifeExp-vs-gdpPercap-scatter3-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lm-fit-1.png b/fig/08-plot-ggplot2-rendered-lm-fit-1.png new file mode 100644 index 000000000..f819c105f Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lm-fit-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-lm-fit2-1.png b/fig/08-plot-ggplot2-rendered-lm-fit2-1.png new file mode 100644 index 000000000..d93c05d0e Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-lm-fit2-1.png differ diff --git a/fig/08-plot-ggplot2-rendered-theme-1.png b/fig/08-plot-ggplot2-rendered-theme-1.png new file mode 100644 index 000000000..a9bd55f56 Binary files /dev/null and b/fig/08-plot-ggplot2-rendered-theme-1.png differ diff --git a/fig/09-vectorization-rendered-ch2-sol-1.png b/fig/09-vectorization-rendered-ch2-sol-1.png new file mode 100644 index 000000000..99fe38be7 Binary files /dev/null and b/fig/09-vectorization-rendered-ch2-sol-1.png differ diff --git a/fig/09-vectorization-rendered-ch2-sol-2.png b/fig/09-vectorization-rendered-ch2-sol-2.png new file mode 100644 index 000000000..5b630819b Binary files /dev/null and b/fig/09-vectorization-rendered-ch2-sol-2.png differ diff --git a/fig/12-plyr-fig1.png b/fig/12-plyr-fig1.png new file mode 100644 index 000000000..249bab4fa Binary files /dev/null and b/fig/12-plyr-fig1.png differ diff --git a/fig/12-plyr-fig1.tex b/fig/12-plyr-fig1.tex new file mode 100644 index 000000000..ded41a78c --- /dev/null +++ b/fig/12-plyr-fig1.tex @@ -0,0 +1,143 @@ +\documentclass[convert]{standalone} + +\usepackage{tikz} +\usepackage{colortbl} +\renewcommand{\familydefault}{\sfdefault} + +\begin{document} + +\begin{tikzpicture} + +% Headings + +\node (INPUT-LABEL) at (0, 5) {Input Data}; +\node (GROUP-LABEL) at (3, 5) {Split}; +\node (SUMMARY-LABEL) at (6, 5) {Apply}; +\node (OUTPUT-LABEL) at (9, 5) {Combine}; + + +% Data Nodes + +\node (INPUT) at (0, 2) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + a & 2 \\ \hline + a & 4 \\ \hline + b & 0 \\ \hline + b & 5 \\ \hline + c & 5 \\ \hline + c & 10 \\ \hline + \end{tabular} + +}; + +\node (GROUP-A) at (3, 4) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + a & 2 \\ \hline + a & 4 \\ \hline + \end{tabular} + +}; + +\node (GROUP-B) at (3, 2) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + b & 0 \\ \hline + b & 5 \\ \hline + \end{tabular} + +}; + +\node (GROUP-C) at (3, 0) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + c & 5 \\ \hline + c & 10 \\ \hline + \end{tabular} + +}; + +\node (SUMMARY-A) at (6, 4) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + a & 3.0 \\ \hline + \end{tabular} + +}; + +\node (SUMMARY-B) at (6, 2) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + b & 2.5 \\ \hline + \end{tabular} + +}; + +\node (SUMMARY-C) at (6, 0) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + c & 7.5 \\ \hline + \end{tabular} + +}; + +\node (OUPUT) at (9, 2) { + + \begin{tabular}{| c | r |} + \hline + \rowcolor[gray]{.7} + x & y \\ \hline + a & 3.0 \\ \hline + b & 2.5 \\ \hline + c & 7.5 \\ \hline + \end{tabular} + +}; + + +% Arrows + +\draw[->, to path={-> (\tikztotarget)}] + (INPUT) edge (GROUP-A) + (INPUT) edge (GROUP-B) + (INPUT) edge (GROUP-C) + + (GROUP-A) edge (SUMMARY-A) + (GROUP-B) edge (SUMMARY-B) + (GROUP-C) edge (SUMMARY-C) + + (SUMMARY-A) edge (OUPUT) + (SUMMARY-B) edge (OUPUT) + (SUMMARY-C) edge (OUPUT) +; + +\end{tikzpicture} + +\end{document} + +%------------------------ +% References +% https://tex.stackexchange.com/questions/251642/draw-arrows-between-nodes-with-tikz +% https://tex.stackexchange.com/questions/11866/compile-a-latex-document-into-a-png-image-thats-as-short-as-possible diff --git a/fig/12-plyr-fig2.png b/fig/12-plyr-fig2.png new file mode 100644 index 000000000..d00d25f5c Binary files /dev/null and b/fig/12-plyr-fig2.png differ diff --git a/fig/12-plyr-fig2.tex b/fig/12-plyr-fig2.tex new file mode 100644 index 000000000..56fdfcd3f --- /dev/null +++ b/fig/12-plyr-fig2.tex @@ -0,0 +1,64 @@ +\documentclass[convert]{standalone} + +\usepackage{array} +\usepackage{multirow} +\usepackage{rotating} +\usepackage{colortbl} +\renewcommand{\familydefault}{\sfdefault} +\renewcommand{\arraystretch}{2.2} + +\begin{document} + +\begin{tabular}{crccccc} + +& +& \multicolumn{4}{c}{Output} +\\ + +& %\cellcolor[gray]{0.7} +& \cellcolor[gray]{0.7}array +& \cellcolor[gray]{0.7}data frame +& \cellcolor[gray]{0.7}list +& \cellcolor[gray]{0.7}nothing +\\ + +& \cellcolor[gray]{0.7}array +& aaply +& adply +& alply +& a\_ply +\\ + +& \cellcolor[gray]{0.7}data frame +& daply +& ddply +& dlply +& d\_ply +\\ + +& \cellcolor[gray]{0.7} list +& laply +& ldply +& llply +& l\_ply +\\ + +& \cellcolor[gray]{0.7}n replicates +& raply +& rdply +& rlply +& r\_ply +\\ + +\multirow{-5}{*}{\rotatebox[origin=c]{90}{Input}} +& \cellcolor[gray]{0.7}function arguments +& maply +& mdply +& mlply +& m\_ply +\\ + +\end{tabular} + + +\end{document} diff --git a/fig/12-plyr-generate-figures.sh b/fig/12-plyr-generate-figures.sh new file mode 100755 index 000000000..9236d34e0 --- /dev/null +++ b/fig/12-plyr-generate-figures.sh @@ -0,0 +1,10 @@ +#! /bin/bash + +pdflatex -shell-escape 12-plyr-fig1.tex + +rm 12-plyr-fig1.aux 12-plyr-fig1.log 12-plyr-fig1.pdf + +pdflatex -shell-escape 12-plyr-fig2.tex + +rm 12-plyr-fig2.aux 12-plyr-fig2.log 12-plyr-fig2.pdf + diff --git a/fig/13-dplyr-fig1.png b/fig/13-dplyr-fig1.png new file mode 100644 index 000000000..7f3067a3c Binary files /dev/null and b/fig/13-dplyr-fig1.png differ diff --git a/fig/13-dplyr-fig2.png b/fig/13-dplyr-fig2.png new file mode 100644 index 000000000..caa86d462 Binary files /dev/null and b/fig/13-dplyr-fig2.png differ diff --git a/fig/13-dplyr-fig3.png b/fig/13-dplyr-fig3.png new file mode 100644 index 000000000..ae00ce386 Binary files /dev/null and b/fig/13-dplyr-fig3.png differ diff --git a/fig/13-dplyr-generate-figures.R b/fig/13-dplyr-generate-figures.R new file mode 100644 index 000000000..4c3f4b223 --- /dev/null +++ b/fig/13-dplyr-generate-figures.R @@ -0,0 +1,383 @@ +# export figures manually +library(DiagrammeR) +##################################### 13-dplyr-fig2.png #####################################) +grViz('digraph html { + table1 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
>]; + + table2 [shape=none, margin=0, label=< + + + + + + + + + + + + + + + + + + + + +
>]; + + table1:f1:s -> table2:f1:s + table1:f0:n -> table2:f0:n + + subgraph { + rank = same; table1; table2; + } + + labelloc="t"; + fontname="Courier"; + label="select(data.frame, a, c)"; + } + ') + +##################################### 13-dplyr-fig2.png ##################################### +grViz('digraph html { + rankdir=LR; + table1 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
>]; + table2 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + +
>]; + table3 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
>]; + table4 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
>]; + + table1:f0 -> table2:f0 + table1:f1 -> table3:f1 + table1:f2 -> table4:f2 + + + subgraph { + rank = same; table2; table3 ;table4; + } + + labelloc="t"; + fontname="Courier"; + label="gapminder %>%\\l\tgroup_by(a)"; + } + ') + +##################################### 13-dplyr-fig3.png ##################################### +grViz('digraph html { + rankdir=LR; + + table1 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
>]; + + table2 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
>]; + + table3 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
>]; + + table4 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + +
>]; + + table5 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + +
>]; + + + table1:f0 -> table2:f0 + table1:f1 -> table3:f1 + table1:f2 -> table4:f2 + table2:f3:n -> table5:f0 + table3:f3 -> table5:f1 + table4:f3 -> table5:f2:w + + subgraph { + table1; table2; table3 ;table4; table5 + } + + subgraph { + rank = same; table2; table3; table4; + } + + labelloc="t"; + fontname="Courier"; + label="gapminder %>%\\l\tgroup_by(a) %>%\\l\tsummarize(mean_b=mean(b))\\l"; + } + ') diff --git a/fig/13-dplyr-rendered-unnamed-chunk-27-1.png b/fig/13-dplyr-rendered-unnamed-chunk-27-1.png new file mode 100644 index 000000000..bc7c4d02c Binary files /dev/null and b/fig/13-dplyr-rendered-unnamed-chunk-27-1.png differ diff --git a/fig/13-dplyr-rendered-unnamed-chunk-28-1.png b/fig/13-dplyr-rendered-unnamed-chunk-28-1.png new file mode 100644 index 000000000..bc7c4d02c Binary files /dev/null and b/fig/13-dplyr-rendered-unnamed-chunk-28-1.png differ diff --git a/fig/13-dplyr-rendered-unnamed-chunk-29-1.png b/fig/13-dplyr-rendered-unnamed-chunk-29-1.png new file mode 100644 index 000000000..75472eaee Binary files /dev/null and b/fig/13-dplyr-rendered-unnamed-chunk-29-1.png differ diff --git a/fig/14-tidyr-fig1.png b/fig/14-tidyr-fig1.png new file mode 100644 index 000000000..4ce006667 Binary files /dev/null and b/fig/14-tidyr-fig1.png differ diff --git a/fig/14-tidyr-fig2.png b/fig/14-tidyr-fig2.png new file mode 100644 index 000000000..7287d0194 Binary files /dev/null and b/fig/14-tidyr-fig2.png differ diff --git a/fig/14-tidyr-fig3.png b/fig/14-tidyr-fig3.png new file mode 100644 index 000000000..4c13aa57d Binary files /dev/null and b/fig/14-tidyr-fig3.png differ diff --git a/fig/14-tidyr-fig3.svg b/fig/14-tidyr-fig3.svg new file mode 100644 index 000000000..6f756ef15 --- /dev/null +++ b/fig/14-tidyr-fig3.svg @@ -0,0 +1,269 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +ID +a1 +a2 +a3 +ID +2 +1 +3 +2 +1 +3 +2 +1 +3 +2 +1 +3 +a1 +a2 +a3 +a1 +a2 +a3 +a1 +a2 +a3 +key +value +wide format +long format +pivot_longer(data, cols = c("a1", "a2", "a3"), names_to = "key", values_to = "value") + + + + + + + + + + + + + + + + +ID +a1 +a2 +a3 +2 +1 +3 + + + + +ID +2 +1 +3 + + + + +ID +2 +1 +3 + + + + + + + + + + + + + + + + +ID +a1 +a2 +a3 +2 +1 +3 + + + + +ID +2 +1 +3 + + + + +ID +2 +1 +3 + + + + + + + + + + + + + +ID +2 +1 +3 + + + + +ID +2 +1 +3 + + + + +ID +2 +1 +3 + + + +a1 +a2 +a3 + + + +a1 +a2 +a3 + + + +a1 +a2 +a3 + + + + + + + + +separate byselected columns + + + + + + + + +convert column names to column + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +name columns with keyand value arguments + diff --git a/fig/14-tidyr-fig4.png b/fig/14-tidyr-fig4.png new file mode 100644 index 000000000..fd5d68c64 Binary files /dev/null and b/fig/14-tidyr-fig4.png differ diff --git a/fig/14-tidyr-generate-figures.R b/fig/14-tidyr-generate-figures.R new file mode 100644 index 000000000..9ed954ae4 --- /dev/null +++ b/fig/14-tidyr-generate-figures.R @@ -0,0 +1,385 @@ +# export figures manually +library(DiagrammeR) +##################################### 14-tidyr-fig1.png ##################################### +grViz('digraph html { + + table1 [shape=none, margin=0, label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
>]; + + table2 [shape=none, margin=0,label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
>]; + + subgraph { + rank = same; table1; table2; + } + + + labelloc="t"; + fontname="Courier"; + label="wide vs long"; + } + ') + +##################################### 14-tidyr-fig2.png ##################################### +grViz('digraph html { + table1 [shape=none, margin=0, label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
>]; + + labelloc="t"; + fontname="Courier"; + label="wide format"; + } + ') + +##################################### 14-tidyr-fig3.png ##################################### +grViz('digraph html { + + table1 [shape=none, margin=0, label=< + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
>]; + + labelloc="t"; + fontname="Courier"; + label="long format"; + } + ') diff --git a/fig/15-knitr-markdown-rendered-rmd_to_html_fig-1.png b/fig/15-knitr-markdown-rendered-rmd_to_html_fig-1.png new file mode 100644 index 000000000..976f7bc79 Binary files /dev/null and b/fig/15-knitr-markdown-rendered-rmd_to_html_fig-1.png differ diff --git a/fig/New_R_Markdown.png b/fig/New_R_Markdown.png new file mode 100644 index 000000000..8542fe9bd Binary files /dev/null and b/fig/New_R_Markdown.png differ diff --git a/fig/bad_layout.png b/fig/bad_layout.png new file mode 100644 index 000000000..fcfda0c5a Binary files /dev/null and b/fig/bad_layout.png differ diff --git a/fig/rmd-06-equality.0.svg b/fig/rmd-06-equality.0.svg new file mode 100644 index 000000000..9671b0b3e --- /dev/null +++ b/fig/rmd-06-equality.0.svg @@ -0,0 +1,288 @@ + + + + + + + + + + image/svg+xml + + + + + + + c("a", "a", "a") + c("a", "c") + + + + + + + TRUE + FALSE + ?? + ?? + c("a", "a", "a") + c("a", "c") + + + + + + + TRUE + FALSE + c("a"... + TRUE + + diff --git a/fig/rmd-06-equality.1.png b/fig/rmd-06-equality.1.png new file mode 100644 index 000000000..f4152a338 Binary files /dev/null and b/fig/rmd-06-equality.1.png differ diff --git a/fig/rmd-06-equality.2.png b/fig/rmd-06-equality.2.png new file mode 100644 index 000000000..e33f4cf4f Binary files /dev/null and b/fig/rmd-06-equality.2.png differ diff --git a/fig/software-carpentry-banner.png b/fig/software-carpentry-banner.png new file mode 100644 index 000000000..746a9c53c Binary files /dev/null and b/fig/software-carpentry-banner.png differ diff --git a/fig/visual_mode_icon.png b/fig/visual_mode_icon.png new file mode 100644 index 000000000..d224e3cee Binary files /dev/null and b/fig/visual_mode_icon.png differ diff --git a/images.html b/images.html new file mode 100644 index 000000000..0fd00617b --- /dev/null +++ b/images.html @@ -0,0 +1,647 @@ + + + + + +R for Reproducible Scientific Analysis: All Images + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + + +
+ + +

Introduction to R and RStudio


Figure 1

+ +
RStudio layout


Figure 2

+ +
RStudio layout with .R file open

Project Management With RStudio


Figure 1

+ +
Screenshot of file manager demonstrating bad project organisation

Seeking Help


Data Structures


Exploring Data Frames


Subsetting Data


Figure 1

+ +
Inequality testing


Figure 2

+ +
Inequality testing: results of recycling

Control Flow


Creating Publication-Quality Graphics with ggplot2


Figure 1

+ +
Blank plot, before adding any mapping aesthetics to ggplot().


Figure 2

+ +
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.


Figure 3

+ +
Scatter plot of life expectancy vs GDP per capita, now showing the data points.


Figure 4

+ +
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +


Figure 5

+ +
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +


Figure 6



Figure 7



Figure 8



Figure 9



Figure 10

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.


Figure 11



Figure 12

+ +
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +


Figure 13

+ +
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.


Figure 14

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.


Figure 15

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.


Figure 16



Figure 17



Figure 18



Figure 19




Figure 1

+ +
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.


Figure 2

+ +
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

Functions Explained


Writing Data


Splitting and Combining Data Frames with plyr


Figure 1

+ +
Split apply combine


Figure 2

+ +
Full apply suite

Data Frame Manipulation with dplyr


Figure 1

+ +

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.


Figure 2

+ +
Diagram illustrating how the group by function oraganizes a data frame into groups


Figure 3

+ +
Diagram illustrating the use of group by and summarize together to create a new variable


Figure 4



Figure 5



Figure 6


Data Frame Manipulation with tidyr


Figure 1

+ +
Diagram illustrating the difference between a wide versus long layout of a data frame


Figure 2

+ +
Diagram illustrating the wide format of the gapminder data frame


Figure 3

+ +
Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format


Figure 4

+ +
Diagram illustrating the long format of the gapminder data

Producing Reports With knitr


Figure 1

+ +
Screenshot of the New R Markdown file dialogue box in RStudio


Figure 2



Figure 3


RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.


Writing Good Software

+ + +
+ + +
+ + + + + diff --git a/index.html b/index.html new file mode 100644 index 000000000..e07b6bb42 --- /dev/null +++ b/index.html @@ -0,0 +1,464 @@ + +R for Reproducible Scientific Analysis: Summary and Setup +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Summary and Setup

+ + +

an introduction to R for non-programmers using gapminder +data


The goal of this lesson is to teach novice programmers to write +modular code and best practices for using R for data analysis. R is +commonly used in many scientific disciplines for statistical analysis +and its array of third-party packages. We find that many scientists who +come to Software Carpentry workshops use R and want to learn more. The +emphasis of these materials is to give attendees a strong foundation in +the fundamentals of R, and to teach best practices for scientific +computing: breaking down analyses into modular units, task automation, +and encapsulation.


Note that this workshop will focus on teaching the fundamentals of +the programming language R, and will not teach statistical analysis.


The lesson contains more material than can be taught in a day. The instructor notes page has some +suggested lesson plans suitable for a one or half day workshop.


A variety of third party packages are used throughout this workshop. +These are not necessarily the best, nor are they comprehensive, but they +are packages we find useful, and have been chosen primarily for their +usability.

+ +

Prerequisites +


Understand that computers store data and instructions (programs, +scripts etc.) in files. Files are organised in directories (folders). +Know how to access files not in the working directory by specifying the +path.

+ + +

This lesson assumes you have R and RStudio installed on your +computer.

+ + +
+ + + diff --git a/instructor-notes.html b/instructor-notes.html new file mode 100644 index 000000000..ba32de996 --- /dev/null +++ b/instructor-notes.html @@ -0,0 +1,629 @@ + + + + + +R for Reproducible Scientific Analysis: Instructor Notes + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + + +

Instructor Notes

+ + +

Timing +


Leave about 30 minutes at the start of each workshop and another 15 +mins at the start of each session for technical difficulties like WiFi +and installing things (even if you asked students to install in advance, +longer if not).


Lesson Plans +


The lesson contains much more material than can be taught in a day. +Instructors will need to pick an appropriate subset of episodes to use +in a standard one day course.


Some suggested paths through the material are:


(suggested by @liz-is)

  • 01 Introduction to R and RStudio
  • +
  • 04 Data Structures
  • +
  • 05 Exploring Data Frames (“Realistic example” section onwards)
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2
  • +
  • 10 Functions Explained
  • +
  • 13 Dataframe Manipulation with dplyr
  • +
  • 15 Producing Reports With knitr
  • +

(suggested by @naupaka)

  • 01 Introduction to R and RStudio
  • +
  • 02 Project Management With RStudio
  • +
  • 03 Seeking Help
  • +
  • 04 Data Structures
  • +
  • 05 Exploring Data Frames
  • +
  • 06 Subsetting Data
  • +
  • 09 Vectorization
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2 OR 13 +Dataframe Manipulation with dplyr
  • +
  • 15 Producing Reports With knitr
  • +

A half day course could consist of (suggested by @karawoo):

  • 01 Introduction to R and RStudio
  • +
  • 04 Data Structures (only creating vectors with +c())
  • +
  • 05 Exploring Data Frames (“Realistic example” section onwards)
  • +
  • 06 Subsetting Data (excluding factor, matrix and list +subsetting)
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2
  • +

Setting up git in RStudio +


There can be difficulties linking git to RStudio depending on the +operating system and the version of the operating system. To make sure +Git is properly installed and configured, the learners should go to the +Options window in the RStudio application.

  • +Mac OS X: +
    • Go RStudio -> Preferences… -> Git/SVN
    • +
    • Check and see whether there is a path to a file in the “Git +executable” window. If not, the next challenge is figuring out where Git +is located.
    • +
    • In the terminal enter which git and you will get a path +to the git executable. In the “Git executable” window you may have +difficulties finding the directory since OS X hides many of the +operating system files. While the file selection window is open, +pressing “Command-Shift-G” will pop up a text entry box where you will +be able to type or paste in the full path to your git executable: +e.g. /usr/bin/git or whatever else it might be.
    • +
  • +
  • +Windows: +
    • Go Tools -> Global options… -> Git/SVN
    • +
    • If you use the Software Carpentry Installer, then ‘git.exe’ should +be installed at C:/Program Files/Git/bin/git.exe.
    • +
  • +

To prevent the learners from having to re-enter their password each +time they push a commit to GitHub, this command (which can be run from a +bash prompt) will make it so they only have to enter their password +once:



$ git config --global credential.helper 'cache --timeout=10000000'

Pulling in Data +


The easiest way to get the data used in this lesson during a workshop +is to have attendees download the raw data from gapminder-data and gapminder-data-wide.


Attendees can use the File - Save As dialog in their +browser to save the file.


Overall +


Make sure to emphasize good practices: put code in scripts, and make +sure they’re version controlled. Encourage students to create script +files for challenges.


If you’re working in a cloud environment, get them to upload the +gapminder data after the second lesson.


Make sure to emphasize that matrices are vectors underneath the hood +and data frames are lists underneath the hood: this will explain a lot +of the esoteric behaviour encountered in basic operations.


Vector recycling and function stacks are probably best explained with +diagrams on a whiteboard.


Be sure to actually go through examples of an R help page: help files +can be intimidating at first, but knowing how to read them is +tremendously useful.


Be sure to show the CRAN task views, look at one of the topics.


There’s a lot of content: move quickly through the earlier lessons. +Their extensiveness is mostly for purposes of learning by osmosis: so +that their memory will trigger later when they encounter a problem or +some esoteric behaviour.


Key lessons to take time on:

  • Data subsetting - conceptually difficult for novices
  • +
  • Functions - learners especially struggle with this
  • +
  • Data structures - worth being thorough, but you can go through it +quickly.
  • +

Don’t worry about being correct or knowing the material +back-to-front. Use mistakes as teaching moments: the most vital skill +you can impart is how to debug and recover from unexpected errors.

+ + +
+ + +
+ + + + + diff --git a/instructor/01-rstudio-intro.html b/instructor/01-rstudio-intro.html new file mode 100644 index 000000000..b643c6f02 --- /dev/null +++ b/instructor/01-rstudio-intro.html @@ -0,0 +1,1470 @@ + +R for Reproducible Scientific Analysis: Introduction to R and RStudio +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Introduction to R and RStudio


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 55 minutes

+ +
+ +
+ + + +




  • How to find your way around RStudio?
  • +
  • How to interact with R?
  • +
  • How to manage your environment?
  • +
  • How to install packages?
  • +


  • Describe the purpose and use of each pane in the RStudio IDE
  • +
  • Locate buttons and options in the RStudio IDE
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Manage a workspace in an interactive R session
  • +
  • Use mathematical and comparison operators
  • +
  • Call functions
  • +
  • Manage packages
  • +

Motivation +


Science is a multi-step process: once you’ve designed an experiment +and collected data, the real fun begins! This lesson will teach you how +to start this process using R and RStudio. We will begin with raw data, +perform exploratory analyses, and learn how to plot results graphically. +This example starts with a dataset from gapminder.org containing population +information for many countries through time. Can you read the data into +R? Can you plot the population for Senegal? Can you calculate the +average income for countries on the continent of Asia? By the end of +these lessons you will be able to do things like plot the populations +for all of these countries in under a minute!


Before Starting The Workshop +


Please ensure you have the latest version of R and RStudio installed +on your machine. This is important, as some packages used in the +workshop may not install correctly (or at all) if R is not up to +date.


Introduction to RStudio +


Welcome to the R portion of the Software Carpentry workshop.


Throughout this lesson, we’re going to teach you some of the +fundamentals of the R language as well as some best practices for +organizing code for scientific projects that will make your life +easier.


We’ll be using RStudio: a free, open-source R Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms +(including on servers) and provides many advantages such as integration +with version control and project management.


Basic layout


When you first open RStudio, you will be greeted by three panels:

  • The interactive R console/Terminal (entire left)
  • +
  • Environment/History/Connections (tabbed in upper right)
  • +
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • +
RStudio layout

Once you open files, such as R scripts, an editor panel will also +open in the top left.

RStudio layout with .R file open
+ +

R scripts +


Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have .R at the end of their +names to let you know what they are.


Workflow within RStudio +


There are two main ways one can work within RStudio:

  1. Test and play within the interactive R console then copy code into a +.R file to run later.
  2. +
  • This works well when doing small tests and initially starting +off.
  • +
  • It quickly becomes laborious
  • +
  1. Start writing in a .R file and use RStudio’s short cut keys for the +Run command to push the current line, selected lines or modified lines +to the interactive R console.
  2. +
  • This is a great way to start; all your code is saved for later
  • +
  • You will be able to run the file you create from within RStudio or +using R’s source() function.
  • +
+ +

Tip: Running segments of your code +


RStudio offers you great flexibility in running code from within the +editor window. There are buttons, menu choices, and keyboard shortcuts. +To run the current line, you can

  1. click on the Run button above the editor panel, or
  2. +
  3. select “Run Lines” from the “Code” menu, or
  4. +
  5. hit Ctrl+Return in Windows or Linux or ++Return on OS X. (This shortcut can also be seen +by hovering the mouse over the button). To run a block of code, select +it and then Run. If you have modified a line of code within +a block of code you have just run, there is no need to reselect the +section and Run, you can use the next button along, +Re-run the previous region. This will run the previous code +block including the modifications you have made.
  6. +

Introduction to R +


Much of your time in R will be spent in the R interactive console. +This is where you will run all of your code, and can be a useful +environment to try out ideas before adding them to an R script file. +This console in RStudio is the same as the one you would get if you +typed in R in your command-line environment.


The first thing you will see in the R interactive session is a bunch +of information, followed by a “>” and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a “Read, evaluate, print +loop”: you type in commands, R tries to execute them, and then returns a +result.


Using R as a calculator +


The simplest thing you could do with R is to do arithmetic:


R +

+1 + 100


[1] 101

And R will print out the answer, with a preceding “[1]”. [1] is the +index of the first element of the line being printed in the console. For +more information on indexing vectors, see Episode +6: Subsetting Data.


If you type in an incomplete command, R will wait for you to complete +it. If you are familiar with Unix Shell’s bash, you may recognize +this
+behavior from bash.


R +

> 1 +



Any time you hit return and the R session shows a “+” instead of a +“>”, it means it’s waiting for you to complete the command. If you +want to cancel a command you can hit Esc and RStudio will +give you back the “>” prompt.

+ +

Tip: Canceling commands +


If you’re using R from the command line instead of from within +RStudio, you need to use Ctrl+C instead of +Esc to cancel the command. This applies to Mac users as +well!


Canceling a command isn’t only useful for killing incomplete +commands: you can also use it to tell R to stop running code (for +example if it’s taking much longer than you expect), or to get rid of +the code you’re currently writing.


When using R as a calculator, the order of operations is the same as +you would have learned back in school.


From highest to lowest precedence:

  • Parentheses: (, ) +
  • +
  • Exponents: ^ or ** +
  • +
  • Multiply: * +
  • +
  • Divide: / +
  • +
  • Add: + +
  • +
  • Subtract: - +
  • +

R +

+3 + 5 * 2


[1] 13

Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend.


R +

+(3 + 5) * 2


[1] 16

This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code.


R +

+(3 + (5 * (2 ^ 2))) # hard to read
+3 + 5 * 2 ^ 2       # clear, if you remember the rules
+3 + 5 * (2 ^ 2)     # if you forget some rules, this might help

The text after each line of code is called a “comment”. Anything that +follows after the hash (or octothorpe) symbol # is ignored +by R when it executes code.


Really small or large numbers get a scientific notation:


R +



[1] 2e-04

Which is shorthand for “multiplied by 10^XX”. So +2e-4 is shorthand for 2 * 10^(-4).


You can write numbers in scientific notation too:


R +

+5e3  # Note the lack of minus here


[1] 5000

Mathematical functions +


R has many built in mathematical functions. To call a function, we +can type its name, followed by open and closing parentheses. Functions +take arguments as inputs, anything we type inside the parentheses of a +function is considered an argument. Depending on the function, the +number of arguments can vary from none to multiple. For example:


R +

+getwd() #returns an absolute filepath

doesn’t require an argument, whereas for the next set of mathematical +functions we will need to supply the function a value in order to +compute the result.


R +

+sin(1)  # trigonometry functions


[1] 0.841471

R +

+log(1)  # natural logarithm


[1] 0

R +

+log10(10) # base-10 logarithm


[1] 1

R +

+exp(0.5) # e^(1/2)


[1] 1.648721

Don’t worry about trying to remember every function in R. You can +look them up on Google, or if you can remember the start of the +function’s name, use the tab completion in RStudio.


This is one advantage that RStudio has over R on its own, it has +auto-completion abilities that allow you to more easily look up +functions, their arguments, and the values that they take.


Typing a ? before the name of a command will open the +help page for that command. When using RStudio, this will open the +‘Help’ pane; if using R in the terminal, the help page will open in your +browser. The help page will include a detailed description of the +command and how it works. Scrolling to the bottom of the help page will +usually show a collection of code examples which illustrate command +usage. We’ll go through an example later.


Comparing things +


We can also do comparisons in R:


R +

+1 == 1  # equality (note two equals signs, read as "is equal to")


[1] TRUE

R +

+1 != 2  # inequality (read as "is not equal to")


[1] TRUE

R +

+1 < 2  # less than


[1] TRUE

R +

+1 <= 1  # less than or equal to


[1] TRUE

R +

+1 > 0  # greater than


[1] TRUE

R +

+1 >= -9 # greater than or equal to


[1] TRUE
+ +

Tip: Comparing Numbers +


A word of warning about comparing numbers: you should never use +== to compare two numbers unless they are integers (a data +type which can specifically represent only whole numbers).


Computers may only represent decimal numbers with a certain degree of +precision, so two numbers which look the same when printed out by R, may +actually have different underlying representations and therefore be +different by a small margin of error (called Machine numeric +tolerance).


Instead you should use the all.equal function.


Further reading: http://floating-point-gui.de/


Variables and assignment +


We can store values in variables using the assignment operator +<-, like this:


R +

+x <- 1/40

Notice that assignment does not print a value. Instead, we stored it +for later in something called a variable. +x now contains the value +0.025:


R +



[1] 0.025

More precisely, the stored value is a decimal approximation +of this fraction called a floating point +number.


Look for the Environment tab in the top right panel of +RStudio, and you will see that x and its value have +appeared. Our variable x can be used in place of a number +in any calculation that expects a number:


R +



[1] -3.688879

Notice also that variables can be reassigned:


R +

+x <- 100

x used to contain the value 0.025 and now it has the +value 100.


Assignment values can contain the variable being assigned to:


R +

+x <- x + 1 #notice how RStudio updates its description of x on the top right tab
+y <- x * 2

The right hand side of the assignment can be any valid R expression. +The right hand side is fully evaluated before the assignment +occurs.


Variable names can contain letters, numbers, underscores and periods +but no spaces. They must start with a letter or a period followed by a +letter (they cannot start with a number nor an underscore). Variables +beginning with a period are hidden variables. Different people use +different conventions for long variable names, these include

  • periods.between.words
  • +
  • underscores_between_words
  • +
  • camelCaseToSeparateWords
  • +

What you use is up to you, but be consistent.


It is also possible to use the = operator for +assignment:


R +

+x = 1/40

But this is much less common among R users. The most important thing +is to be consistent with the operator you use. There +are occasionally places where it is less confusing to use +<- than =, and it is the most common symbol +used in the community. So the recommendation is to use +<-.

+ +

Challenge 1 +


Which of the following are valid R variable names?


R +

+ +

The following can be used as R variables:


R +


The following creates a hidden variable:


R +


The following will not be able to be used to create a variable


R +


Vectorization +


One final thing to be aware of is that R is vectorized, +meaning that variables and functions can have vectors as values. In +contrast to physics and mathematics, a vector in R describes a set of +values in a certain order of the same data type. For example


R +



[1] 1 2 3 4 5

R +



[1]  2  4  8 16 32

R +

+x <- 1:5


[1]  2  4  8 16 32

This is incredibly powerful; we will discuss this further in an +upcoming lesson.


Managing your environment +


There are a few useful commands you can use to interact with the R +session.


ls will list all of the variables and functions stored +in the global environment (your working R session):


R +



[1] "x" "y"
+ +

Tip: hidden objects +


Like in the shell, ls will hide any variables or +functions starting with a “.” by default. To list all objects, type +ls(all.names=TRUE) instead


Note here that we didn’t give any arguments to ls, but +we still needed to give the parentheses to tell R to call the +function.


If we type ls by itself, R prints a bunch of code +instead of a listing of objects.


R +



function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
+    pattern, sorted = TRUE) 
+    if (!missing(name)) {
+        pos <- tryCatch(name, error = function(e) e)
+        if (inherits(pos, "error")) {
+            name <- substitute(name)
+            if (!is.character(name)) 
+                name <- deparse(name)
+            warning(gettextf("%s converted to character string", 
+                sQuote(name)), domain = NA)
+            pos <- name
+        }
+    }
+    all.names <- .Internal(ls(envir, all.names, sorted))
+    if (!missing(pattern)) {
+        if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
+            ll != length(grep("]", pattern, fixed = TRUE))) {
+            if (pattern == "[") {
+                pattern <- "\\["
+                warning("replaced regular expression pattern '[' by  '\\\\['")
+            }
+            else if (length(grep("[^\\\\]\\[<-", pattern))) {
+                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+            }
+        }
+        grep(pattern, all.names, value = TRUE)
+    }
+    else all.names
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>

What’s going on here?


Like everything in R, ls is the name of an object, and +entering the name of an object by itself prints the contents of the +object. The object x that we created earlier contains 1, 2, +3, 4, 5:


R +



[1] 1 2 3 4 5

The object ls contains the R code that makes the +ls function work! We’ll talk more about how functions work +and start writing our own later.


You can use rm to delete objects you no longer need:


R +


If you have lots of things in your environment and want to delete all +of them, you can pass the results of ls to the +rm function:


R +

+rm(list = ls())

In this case we’ve combined the two. Like the order of operations, +anything inside the innermost parentheses is evaluated first, and so +on.


In this case we’ve specified that the results of ls +should be used for the list argument in rm. +When assigning values to arguments by name, you must use the += operator!!


If instead we use <-, there will be unintended side +effects, or you may get an error message:


R +

+rm(list <- ls())


Error in rm(list <- ls()): ... must contain names or character strings
+ +

Tip: Warnings vs. Errors +


Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn’t worked as expected.


In both cases, the message that R prints out usually give you clues +how to fix a problem.


R Packages +


It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages:

  • You can see what packages are installed by typing +installed.packages() +
  • +
  • You can install packages by typing +install.packages("packagename"), where +packagename is the package name, in quotes.
  • +
  • You can update installed packages by typing +update.packages() +
  • +
  • You can remove a package with +remove.packages("packagename") +
  • +
  • You can make a package available for use with +library(packagename) +
  • +

Packages can also be viewed, loaded, and detached in the Packages tab +of the lower right panel in RStudio. Clicking on this tab will display +all of the installed packages with a checkbox next to them. If the box +next to a package name is checked, the package is loaded and if it is +empty, the package is not loaded. Click an empty box to load that +package and click a checked box to detach that package.


Packages can be installed and updated from the Package tab with the +Install and Update buttons at the top of the tab.

+ +

Challenge 2 +


What will be the value of each variable after each statement in the +following program?


R +

+mass <- 47.5
+age <- 122
+mass <- mass * 2.3
+age <- age - 20
+ +

R +

+mass <- 47.5

This will give a value of 47.5 for the variable mass


R +

+age <- 122

This will give a value of 122 for the variable age


R +

+mass <- mass * 2.3

This will multiply the existing value of 47.5 by 2.3 to give a new +value of 109.25 to the variable mass.


R +

+age <- age - 20

This will subtract 20 from the existing value of 122 to give a new +value of 102 to the variable age.

+ +

Challenge 3 +


Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age?

+ +

One way of answering this question in R is to use the +> to set up the following:


R +

+mass > age


[1] TRUE

This should yield a boolean value of TRUE since 109.25 is greater +than 102.

+ +

Challenge 4 +


Clean up your working environment by deleting the mass and age +variables.

+ +

We can use the rm command to accomplish this task


R +

+rm(age, mass)
+ +

Challenge 5 +


Install the following packages: ggplot2, +plyr, gapminder

+ +

We can use the install.packages() command to install the +required packages.


R +


An alternate solution, to install multiple packages with a single +install.packages() command is:


R +

+install.packages(c("ggplot2", "plyr", "gapminder"))
+ +

Keypoints +

  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +
+ + +
+ + + diff --git a/instructor/02-project-intro.html b/instructor/02-project-intro.html new file mode 100644 index 000000000..7f0b12b7a --- /dev/null +++ b/instructor/02-project-intro.html @@ -0,0 +1,822 @@ + +R for Reproducible Scientific Analysis: Project Management With RStudio +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Project Management With RStudio


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 30 minutes

+ +
+ +
+ + + +




  • How can I manage my projects in R?
  • +


  • Create self-contained projects in RStudio
  • +

Introduction +


The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and eventually +everything is a bit mixed together.

+ +

Most people tend to organize their projects like this:

Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

  1. It is really hard to tell which version of your data is the original +and which is the modified;
  2. +
  3. It gets really messy because it mixes files with various extensions +together;
  4. +
  5. It probably takes you a lot of time to actually find things, and +relate the correct figures to the exact code that has been used to +generate it;
  6. +

A good project layout will ultimately make your life easier:

  • It will help ensure the integrity of your data;
  • +
  • It makes it simpler to share your code with someone else (a +lab-mate, collaborator, or supervisor);
  • +
  • It allows you to easily upload your code with your manuscript +submission;
  • +
  • It makes it easier to pick the project back up after a break.
  • +

A possible solution +


Fortunately, there are tools and packages which can help you manage +your work effectively.


One of the most powerful and useful aspects of RStudio is its project +management functionality. We’ll be using this today to create a +self-contained, reproducible project.

+ +

Challenge 1: Creating a self-contained +project +


We’re going to create a new project in RStudio:

  1. Click the “File” menu button, then “New Project”.
  2. +
  3. Click “New Directory”.
  4. +
  5. Click “New Project”.
  6. +
  7. Type in the name of the directory to store your project, +e.g. “my_project”.
  8. +
  9. If available, select the checkbox for “Create a git +repository.”
  10. +
  11. Click the “Create Project” button.
  12. +

The simplest way to open an RStudio project once it has been created +is to click through your file system to get to the directory where it +was saved and double click on the .Rproj file. This will +open RStudio and start your R session in the same directory as the +.Rproj file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added +benefit of allowing you to open multiple projects at the same time each +open to its own project directory. This allows you to keep multiple +projects open without them interfering with each other.

+ +

Challenge 2: Opening an RStudio project +through the file system +

  1. Exit RStudio.
  2. +
  3. Navigate to the directory where you created a project in Challenge +1.
  4. +
  5. Double click on the .Rproj file in that directory.
  6. +

Best practices for project organization +


Although there is no “best” way to lay out a project, there are some +general principles to adhere to that will make project management +easier:


Treat data as read only


This is probably the most important goal of setting up a project. +Data is typically time consuming and/or expensive to collect. Working +with them interactively (e.g., in Excel) where they can be modified +means you are never sure of where the data came from, or how it has been +modified since collection. It is therefore a good idea to treat your +data as “read-only”.


Data Cleaning


In many cases your data will be “dirty”: it will need significant +preprocessing to get into a format R (or any other programming language) +will find useful. This task is sometimes called “data munging”. Storing +these scripts in a separate folder, and creating a second “read-only” +data folder to hold the “cleaned” data sets can prevent confusion +between the two sets.


Treat generated output as disposable


Anything generated by your scripts should be treated as disposable: +it should all be able to be regenerated from your scripts.


There are lots of different ways to manage this output. Having an +output folder with different sub-directories for each separate analysis +makes it easier later. Since many analyses are exploratory and don’t end +up being used in the final project, and some of the analyses get shared +between projects.

+ +

Tip: Good Enough Practices for Scientific +Computing +


Good +Enough Practices for Scientific Computing gives the following +recommendations for project organization:

  1. Put each project in its own directory, which is named after the +project.
  2. +
  3. Put text documents associated with the project in the +doc directory.
  4. +
  5. Put raw data and metadata in the data directory, and +files generated during cleanup and analysis in a results +directory.
  6. +
  7. Put source for the project’s scripts and programs in the +src directory, and programs brought in from elsewhere or +compiled locally in the bin directory.
  8. +
  9. Name all files to reflect their content or function.
  10. +

Separate function definition and application


One of the more effective ways to work with R is to start by writing +the code you want to run directly in a .R script, and then running the +selected lines (either using the keyboard shortcuts in RStudio or +clicking the “Run” button) in the interactive R console.


When your project is in its early stages, the initial .R script file +usually contains many lines of directly executed code. As it matures, +reusable chunks get pulled into their own functions. It’s a good idea to +separate these functions into two separate folders; one to store useful +functions that you’ll reuse across analyses and projects, and one to +store the analysis scripts.


Save the data in the data directory


Now we have a good directory structure we will now place/save the +data file in the data/ directory.

+ +

Challenge 3 +


Download the gapminder data from here.

  1. Download the file (right mouse click on the link above -> “Save +link as” / “Save file as”, or click on the link and after the page +loads, press Ctrl+S or choose File -> “Save +page as”)
  2. +
  3. Make sure it’s saved under the name +gapminder_data.csv +
  4. +
  5. Save the file in the data/ folder within your +project.
  6. +

We will load and inspect these data later.

+ +

Challenge 4 +


It is useful to get some general idea about the dataset, directly +from the command line, before loading it into R. Understanding the +dataset better will come in handy when making decisions on how to load +it in R. Use the command-line shell to answer the following +questions:

  1. What is the size of the file?
  2. +
  3. How many rows of data does it contain?
  4. +
  5. What kinds of values are stored in this file?
  6. +
+ +

By running these commands in the shell:


SH +

ls -lh data/gapminder_data.csv


-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv

The file size is 80K.


SH +

wc -l data/gapminder_data.csv


1705 data/gapminder_data.csv

There are 1705 lines. The data looks like:


SH +

head data/gapminder_data.csv


+ +

Tip: command line in RStudio +


The Terminal tab in the console pane provides a convenient place +directly within RStudio to interact directly with the command line.


Working directory


Knowing R’s current working directory is important because when you +need to access other files (for example, to import a data file), R will +look for them relative to the current working directory.


Each time you create a new RStudio Project, it will create a new +directory for that project. When you open an existing +.Rproj file, it will open that project and set R’s working +directory to the folder that file is in.

+ +

Challenge 5 +


You can check the current working directory with the +getwd() command, or by using the menus in RStudio.

  1. In the console, type getwd() (“wd” is short for +“working directory”) and hit Enter.
  2. +
  3. In the Files pane, double click on the data folder to +open it (or navigate to any other folder you wish). To get the Files +pane back to the current working directory, click “More” and then select +“Go To Working Directory”.
  4. +

You can change the working directory with setwd(), or by +using RStudio menus.

  1. In the console, type setwd("data") and hit Enter. Type +getwd() and hit Enter to see the new working +directory.
  2. +
  3. In the menus at the top of the RStudio window, click the “Session” +menu button, and then select “Set Working Directory” and then “Choose +Directory”. Next, in the windows navigator that opens, navigate back to +the project directory, and click “Open”. Note that a setwd +command will automatically appear in the console.
  4. +
+ +

Tip: File does not exist errors +


When you’re attempting to reference a file in your R code and you’re +getting errors saying the file doesn’t exist, it’s a good idea to check +your working directory. You need to either provide an absolute path to +the file, or you need to make sure the file is saved in the working +directory (or a subfolder of the working directory) and provide a +relative path.


Version Control


It is important to use version control with projects. Go here +for a good lesson which describes using Git with RStudio.

+ +

Keypoints +

  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +
+ + +
+ + + diff --git a/instructor/03-seeking-help.html b/instructor/03-seeking-help.html new file mode 100644 index 000000000..1e66c24ff --- /dev/null +++ b/instructor/03-seeking-help.html @@ -0,0 +1,861 @@ + +R for Reproducible Scientific Analysis: Seeking Help +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Seeking Help


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 20 minutes

+ +
+ +
+ + + +




  • How can I get help in R?
  • +


  • To be able to read R help files for functions and special +operators.
  • +
  • To be able to use CRAN task views to identify packages to solve a +problem.
  • +
  • To be able to seek help from your peers.
  • +

Reading Help Files +


R, and every package, provide help files for functions. The general +syntax to search for help on any function, “function_name”, from a +specific function that is in a package loaded into your namespace (your +interactive R session) is:


R +


For example take a look at the help file for +write.table(), we will be using a similar function in an +upcoming episode.


R +


This will load up a help page in RStudio (or as plain text in R +itself).


Each help page is broken down into sections:

  • Description: An extended description of what the function does.
  • +
  • Usage: The arguments of the function and their default values (which +can be changed).
  • +
  • Arguments: An explanation of the data each argument is +expecting.
  • +
  • Details: Any important details to be aware of.
  • +
  • Value: The data the function returns.
  • +
  • See Also: Any related functions you might find useful.
  • +
  • Examples: Some examples for how to use the function.
  • +

Different functions might have different sections, but these are the +main ones you should be aware of.


Notice how related functions might call for the same help file:


R +


This is because these functions have very similar applicability and +often share the same arguments as inputs to the function, so package +authors often choose to document them together in a single help +file.

+ +

Tip: Running Examples +


From within the function help page, you can highlight code in the +Examples and hit Ctrl+Return to run it in RStudio +console. This gives you a quick way to get a feel for how a function +works.

+ +

Tip: Reading Help Files +


One of the most daunting aspects of R is the large number of +functions available. It would be prohibitive, if not impossible to +remember the correct usage for every function you use. Luckily, using +the help files means you don’t have to remember that!


Special Operators +


To seek help on special operators, use quotes or backticks:


R +


Getting Help with Packages +


Many packages come with “vignettes”: tutorials and extended example +documentation. Without any arguments, vignette() will list +all vignettes for all installed packages; +vignette(package="package-name") will list all available +vignettes for package-name, and +vignette("vignette-name") will open the specified +vignette.


If a package doesn’t have any vignettes, you can usually find help by +typing help("package-name").


RStudio also has a set of excellent cheatsheets for +many packages.


When You Remember Part of the Function Name +


If you’re not sure what package a function is in or how it’s +specifically spelled, you can do a fuzzy search:


R +


A fuzzy search is when you search for an approximate string match. +For example, you may remember that the function to set your working +directory includes “set” in its name. You can do a fuzzy search to help +you identify the function:


R +


When You Have No Idea Where to Begin +


If you don’t know what function or package you need to use CRAN Task Views is a +specially maintained list of packages grouped into fields. This can be a +good starting point.


When Your Code Doesn’t Work: Seeking Help from Your Peers +


If you’re having trouble using a function, 9 times out of 10, the +answers you seek have already been answered on Stack Overflow. You can search +using the [r] tag. Please make sure to see their page on how to ask a good +question.


If you can’t find the answer, there are a few useful functions to +help you ask your peers:


R +


Will dump the data you’re working with into a format that can be +copied and pasted by others into their own R session.


R +



R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+Matrix products: default
+BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
+ [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
+time zone: UTC
+tzcode source: system (glibc)
+attached base packages:
+[1] stats     graphics  grDevices utils     datasets  methods   base     
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0 yaml_2.3.7       
+[5] knitr_1.43        xfun_0.40         renv_1.0.3        evaluate_0.21    

Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue.

+ +

Challenge 1 +


Look at the help page for the c function. What kind of +vector do you expect will be created if you evaluate the following:


R +

+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+ +

The c() function creates a vector, in which all elements +are of the same type. In the first case, the elements are numeric, in +the second, they are characters, and in the third they are also +characters: the numeric values are “coerced” to be characters.

+ +

Challenge 2 +


Look at the help for the paste function. You will need +to use it later. What’s the difference between the sep and +collapse arguments?

+ +

To look at the help for the paste() function, use:


R +


The difference between sep and collapse is +a little tricky. The paste function accepts any number of +arguments, each of which can be a vector of any length. The +sep argument specifies the string used between concatenated +terms — by default, a space. The result is a vector as long as the +longest argument supplied to paste. In contrast, +collapse specifies that after concatenation the elements +are collapsed together using the given separator, the result +being a single string.


It is important to call the arguments explicitly by typing out the +argument name e.g sep = "," so the function understands to +use the “,” as a separator and not a term to concatenate. e.g.


R +

+paste(c("a","b"), "c")


[1] "a c" "b c"

R +

+paste(c("a","b"), "c", ",")


[1] "a c ," "b c ,"

R +

+paste(c("a","b"), "c", sep = ",")


[1] "a,c" "b,c"

R +

+paste(c("a","b"), "c", collapse = "|")


[1] "a c|b c"

R +

+paste(c("a","b"), "c", sep = ",", collapse = "|")


[1] "a,c|b,c"

(For more information, scroll to the bottom of the +?paste help page and look at the examples, or try +example('paste').)

+ +

Challenge 3 +


Use help to find a function (and its associated parameters) that you +could use to load data from a tabular file in which columns are +delimited with “\t” (tab) and the decimal point is a “.” (period). This +check for decimal separator is important, especially if you are working +with international colleagues, because different countries have +different conventions for the decimal point (i.e. comma vs period). +Hint: use ??"read table" to look up functions related to +reading in tabular data.

+ +

The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +read.table(file, sep="\t") (the period is the +default decimal separator for read.table()), +although you may have to change the comment.char argument +as well if your data file contains hash (#) characters.


Other Resources +

+ +

Keypoints +

  • Use help() to get online help in R.
  • +
+ + +
+ + + diff --git a/instructor/04-data-structures-part1.html b/instructor/04-data-structures-part1.html new file mode 100644 index 000000000..6c12f00c1 --- /dev/null +++ b/instructor/04-data-structures-part1.html @@ -0,0 +1,2397 @@ + +R for Reproducible Scientific Analysis: Data Structures +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Data Structures


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 55 minutes

+ +
+ +
+ + + +




  • How can I read data in R?
  • +
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +


  • To be able to identify the 5 main data types.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
  • To understand the information of the attributes “names”, “class”, +and “dim”.
  • +

One of R’s most powerful features is its ability to deal with tabular +data - such as you may already have in a spreadsheet or a CSV file. +Let’s start by making a toy dataset in your data/ +directory, called feline-data.csv:


R +

+cats <- data.frame(coat = c("calico", "black", "tabby"),
+                    weight = c(2.1, 5.0, 3.2),
+                    likes_string = c(1, 0, 1))

We can now save cats as a CSV file. It is good practice +to call the argument names explicitly so the function knows what default +values you are changing. Here we are setting +row.names = FALSE. Recall you can use +?write.csv to pull up the help file to check out the +argument names and their default values.


R +

+write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)

The contents of the new file, feline-data.csv:


R +

+ +

Tip: Editing Text files in R +


Alternatively, you can create data/feline-data.csv using +a text editor (Nano), or within RStudio with the File -> New +File -> Text File menu item.


We can load this into R via the following:


R +

+cats <- read.csv(file = "data/feline-data.csv")


    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1

The read.table function is used for reading in tabular +data stored in a text file where the columns of data are separated by +punctuation characters such as CSV files (csv = comma-separated values). +Tabs and commas are the most common punctuation characters used to +separate or delimit data points in csv files. For convenience R provides +2 other versions of read.table. These are: +read.csv for files where the data are separated with commas +and read.delim for files where the data are separated with +tabs. Of these three functions read.csv is the most +commonly used. If needed it is possible to override the default +delimiting punctuation marks for both read.csv and +read.delim.

+ +

Check your data for factors +


In recent times, the default way how R handles textual data has +changed. Text data was interpreted by R automatically into a format +called “factors”. But there is an easier format that is called +“character”. We will hear about factors later, and what to use them for. +For now, remember that in most cases, they are not needed and only +complicate your life, which is why newer R versions read in text as +“character”. Check now if your version of R has automatically created +factors and convert them to “character” format:

  1. Check the data types of your input by typing +str(cats) +
  2. +
  3. In the output, look at the three-letter codes after the colons: If +you see only “num” and “chr”, you can continue with the lesson and skip +this box. If you find “fct”, continue to step 3.
  4. +
  5. Prevent R from automatically creating “factor” data. That can be +done by the following code: +options(stringsAsFactors = FALSE). Then, re-read the cats +table for the change to take effect.
  6. +
  7. You must set this option every time you restart R. To not forget +this, include it in your analysis script before you read in any data, +for example in one of the first lines.
  8. +
  9. For R versions greater than 4.0.0, text data is no longer converted +to factors anymore. So you can install this or a newer version to avoid +this problem. If you are working on an institute or company computer, +ask your administrator to do it.
  10. +

We can begin exploring our dataset right away, pulling out columns by +specifying them using the $ operator:


R +



[1] 2.1 5.0 3.2

R +



[1] "calico" "black"  "tabby" 

We can do other operations on the columns:


R +

+## Say we discovered that the scale weighs two Kg light:
+cats$weight + 2


[1] 4.1 7.0 5.2

R +

+paste("My cat is", cats$coat)


[1] "My cat is calico" "My cat is black"  "My cat is tabby" 

But what about


R +

+cats$weight + cats$coat


Error in cats$weight + cats$coat: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing +data in R.


Data Types


If you guessed that the last command will return an error because +2.1 plus "black" is nonsense, you’re right - +and you already have some intuition for an important concept in +programming called data types. We can ask what type of data +something is:


R +



[1] "double"

There are 5 main types: double, integer, +complex, logical and character. +For historic reasons, double is also called +numeric.


R +



[1] "double"

R +

+typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers


[1] "integer"

R +



[1] "complex"

R +



[1] "logical"

R +



[1] "character"

No matter how complicated our analyses become, all data in R is +interpreted as one of these basic data types. This strictness has some +really important consequences.


A user has added details of another cat. This information is in the +file data/feline-data_v2.csv.


R +


R +

+tabby,2.3 or 2.4,1

Load the new cats data like before, and check what type of data we +find in the weight column:


R +

+cats <- read.csv(file="data/feline-data_v2.csv")


[1] "character"

Oh no, our weights aren’t the double type anymore! If we try to do +the same math we did on them before, we run into trouble:


R +

+cats$weight + 2


Error in cats$weight + 2: non-numeric argument to binary operator

What happened? The cats data we are working with is +something called a data frame. Data frames are one of the most +common and versatile types of data structures we will work with +in R. A given column in a data frame cannot be composed of different +data types. In this case, R does not read everything in the data frame +column weight as a double, therefore the entire +column data type changes to something that is suitable for everything in +the column.


When R reads a csv file, it reads it in as a data frame. +Thus, when we loaded the cats csv file, it is stored as a +data frame. We can recognize data frames by the first row that is +written by the str() function:


R +



'data.frame':	4 obs. of  3 variables:
+ $ coat        : chr  "calico" "black" "tabby" "tabby"
+ $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
+ $ likes_string: int  1 0 1 1

Data frames are composed of rows and columns, where each +column has the same number of rows. Different columns in a data frame +can be made up of different data types (this is what makes them so +versatile), but everything in a given column needs to be the same type +(e.g., vector, factor, or list).


Let’s explore more about different data structures and how they +behave. For now, let’s remove that extra line from our cats data and +reload it, while we investigate this behavior further:




And back in RStudio:


R +

+cats <- read.csv(file="data/feline-data.csv")

Vectors and Type Coercion


To better understand this behavior, let’s meet another of the data +structures: the vector.


R +

+my_vector <- vector(length = 3)



A vector in R is essentially an ordered list of things, with the +special condition that everything in the vector must be the same +basic data type. If you don’t choose the datatype, it’ll default to +logical; or, you can declare an empty vector of whatever +type you like.


R +

+another_vector <- vector(mode='character', length=3)


[1] "" "" ""

You can check if something is a vector:


R +



 chr [1:3] "" "" ""

The somewhat cryptic output from this command indicates the basic +data type found in this vector - in this case chr, +character; an indication of the number of things in the vector - +actually, the indexes of the vector, in this case [1:3]; +and a few examples of what’s actually in the vector - in this case empty +character strings. If we similarly do


R +



 num [1:3] 2.1 5 3.2

we see that cats$weight is a vector, too - the +columns of data we load into R data.frames are all vectors, and +that’s the root of why R forces everything in a column to be the same +basic data type.

+ +

Discussion 1 +


Why is R so opinionated about what we put in our columns of data? How +does this help us?

+ +

By keeping everything in a column the same, we allow ourselves to +make simple assumptions about our data; if you can interpret one entry +in the column as a number, then you can interpret all of them +as numbers, so we don’t have to check every time. This consistency is +what people mean when they talk about clean data; in the long +run, strict consistency goes a long way to making our lives easier in +R.


Coercion by combining vectors


You can also make vectors with explicit contents with the combine +function:


R +

+combine_vector <- c(2,6,3)


[1] 2 6 3

Given what we’ve learned so far, what do you think the following will +produce?


R +

+quiz_vector <- c(2,6,'3')

This is something called type coercion, and it is the source +of many surprises and the reason why we need to be aware of the basic +data types and how R will interpret them. When R encounters a mix of +types (here double and character) to be combined into a single vector, +it will force them all to be the same type. Consider:


R +

+coercion_vector <- c('a', TRUE)


[1] "a"    "TRUE"

R +

+another_coercion_vector <- c(0, TRUE)


[1] 0 1

The type hierarchy


The coercion rules go: logical -> +integer -> double (“numeric”) +-> complex -> character, where -> can +be read as are transformed into. For example, combining +logical and character transforms the result to +character:


R +

+c('a', TRUE)


[1] "a"    "TRUE"

A quick way to recognize character vectors is by the +quotes that enclose them when they are printed.


You can try to force coercion against this flow using the +as. functions:


R +

+character_vector_example <- c('0','2','4')


[1] "0" "2" "4"

R +

+character_coerced_to_double <- as.double(character_vector_example)


[1] 0 2 4

R +

+double_coerced_to_logical <- as.logical(character_coerced_to_double)



As you can see, some surprising things can happen when R forces one +basic data type into another! Nitty-gritty of type coercion aside, the +point is: if your data doesn’t look like what you thought it was going +to look like, type coercion may well be to blame; make sure everything +is the same type in your vectors and your columns of data.frames, or you +will get nasty surprises!


But coercion can also be very useful! For example, in our +cats data likes_string is numeric, but we know +that the 1s and 0s actually represent TRUE and +FALSE (a common way of representing them). We should use +the logical datatype here, which has two states: +TRUE or FALSE, which is exactly what our data +represents. We can ‘coerce’ this column to be logical by +using the as.logical function:


R +



[1] 1 0 1

R +

+cats$likes_string <- as.logical(cats$likes_string)


+ +

Challenge 1 +


An important part of every data analysis is cleaning the input data. +If you know that the input data is all of the same format, +(e.g. numbers), your analysis is much easier! Clean the cat data set +from the chapter about type coercion.


Copy the code template


Create a new script in RStudio and copy and paste the following code. +Then move on to the tasks below, which help you to fill in the gaps +(______).

# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+# 1. Print the data
+# 2. Show an overview of the table with all data types
+# 3. The "weight" column has the incorrect data type __________.
+#    The correct data type is: ____________.
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+#    print the data again to see the effect
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+#    Calculate the mean to test yourself
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!

Instructions for the tasks

+ +

Execute the first statement (read.csv(...)). Then print +the data to the console

+ +

Show the content of any variable by typing its name.


Solution to Challenge 1.1


Two correct solutions:

+ +

2. Overview of the data types +


The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of +the cats table.

+ +

In the chapter “Data types” we saw two functions that can show data +types. One printed just a single word, the data type name. The other +printed a short form of the data type, and the first few values. We need +the second here.

+ +

Challenge 1 (continued) +


Solution to Challenge 1.2


3. Which data type do we need?


The shown data type is not the right one for this data (weight of a +cat). Which data type do we need?

  • Why did the read.csv() function not choose the correct +data type?
  • +
  • Fill in the gap in the comment with the correct data type for cat +weight!
  • +
+ +

Scroll up to the section about the type +hierarchy to review the available data types

+ +
  • Weight is expressed on a continuous scale (real numbers). The R data +type for this is “double” (also known as “numeric”).
  • +
  • The fourth row has the value “2.3 or 2.4”. That is not a number but +two, and an english word. Therefore, the “character” data type is +chosen. The whole column is now text, because all values in the same +columns have to be the same data type.
  • +
+ +

4. Correct the problematic value +


The code to assign a new weight value to the problematic fourth row +is given. Think first and then execute it: What will be the data type +after assigning a number like in this example? You can check the data +type after executing to see if you were right.

+ +

Revisit the hierarchy of data types when two different data types are +combined.

+ +

Challenge 1 (continued) +


Solution to challenge 1.4


The data type of the column “weight” is “character”. The assigned +data type is “double”. Combining two data types yields the data type +that is higher in the following hierarchy:

logical < integer < double < complex < character

Therefore, the column is still of type character! We need to manually +convert it to “double”. {: .solution}


5. Convert the column “weight” to the correct data type


Cat weight are numbers. But the column does not have this data type +yet. Coerce the column to floating point numbers.

+ +

The functions to convert data types start with as.. You +can look for the function further up in the manuscript or use the +RStudio auto-complete function: Type “as.” and then press +the TAB key.

+ +

Challenge 1 (continued) +


Solution to Challenge 1.5


There are two functions that are synonymous for historic reasons:

cats$weight <- as.double(cats$weight)
+cats$weight <- as.numeric(cats$weight)

Some basic vector functions


The combine function, c(), will also append things to an +existing vector:


R +

+ab_vector <- c('a', 'b')


[1] "a" "b"

R +

+combine_example <- c(ab_vector, 'SWC')


[1] "a"   "b"   "SWC"

You can also make series of numbers:


R +

+mySeries <- 1:10


 [1]  1  2  3  4  5  6  7  8  9 10

R +



 [1]  1  2  3  4  5  6  7  8  9 10

R +

+seq(1,10, by=0.1)


 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
+[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
+[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
+[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
+[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
+[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
+[91] 10.0

We can ask a few questions about vectors:


R +

+sequence_example <- 20:25
+head(sequence_example, n=2)


[1] 20 21

R +

+tail(sequence_example, n=4)


[1] 22 23 24 25

R +



[1] 6

R +



[1] "integer"

We can get individual elements of a vector by using the bracket +notation:


R +

+first_element <- sequence_example[1]


[1] 20

To change a single element, use the bracket on the other side of the +arrow:


R +

+sequence_example[1] <- 30


[1] 30 21 22 23 24 25
+ +

Challenge 2 +


Start by making a vector with the numbers 1 through 26. Then, +multiply the vector by 2.

+ +

R +

+x <- 1:26
+x <- x * 2



Another data structure you’ll want in your bag of tricks is the +list. A list is simpler in some ways than the other types, +because you can put anything you want in it. Remember everything in +the vector must be of the same basic data type, but a list can have +different data types:


R +

+list_example <- list(1, "a", TRUE, 1+4i)


+[1] 1
+[1] "a"
+[1] TRUE
+[1] 1+4i

When printing the object structure with str(), we see +the data types of all elements:


R +



List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i

What is the use of lists? They can organize data of different +types. For example, you can organize different tables that +belong together, similar to spreadsheets in Excel. But there are many +other uses, too.


We will see another example that will maybe surprise you in the next +chapter.


To retrieve one of the elements of a list, use the double +bracket:


R +



[1] "a"

The elements of lists also can have names, they can +be given by prepending them to the values, separated by an equals +sign:


R +

+another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )


+[1] "Numbers"
+ [1]  1  2  3  4  5  6  7  8  9 10
+[1] TRUE

This results in a named list. Now we have a new +function of our object! We can access single elements by an additional +way!


R +



[1] "Numbers"

Names +


With names, we can give meaning to elements. It is the first time +that we do not only have the data, but also explaining +information. It is metadata that can be stuck to the object +like a label. In R, this is called an attribute. Some +attributes enable us to do more with our object, for example, like here, +accessing an element by a self-defined name.


Accessing vectors and lists by name


We have already seen how to generate a named list. The way to +generate a named vector is very similar. You have seen this function +before:


R +

+pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )

The way to retrieve elements is different, though:


R +



+       5.64 

The approach used for the list does not work:


R +



Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors

It will pay off if you remember this error message, you will meet it +in your own analyses. It means that you have just tried accessing an +element like it was in a list, but it is actually in a vector.


Accessing and changing names


If you are only interested in the names, use the names() +function:


R +



[1] "pizzasubito" "pizzafresh"  "callapizza" 

We have seen how to access and change single elements of a vector. +The same is possible for names:


R +



[1] "callapizza"

R +

+names(pizza_price)[3] <- "call-a-pizza"


 pizzasubito   pizzafresh call-a-pizza 
+        5.64         6.60         4.50 
+ +

Challenge 3 +

  • What is the data type of the names of pizza_price? You +can find out using the str() or typeof() +functions.
  • +
+ +

You get the names of an object by wrapping the object name inside +names(...). Similarly, you get the data type of the names +by again wrapping the whole code in typeof(...):


alternatively, use a new variable if this is easier for you to +read:

n <- names(pizza)
+ +

Challenge 4 +


Instead of just changing some of the names a vector/list already has, +you can also set all names of an object by writing code like (replace +ALL CAPS text):


Create a vector that gives the number for each letter in the +alphabet!

  1. Generate a vector called letter_no with the sequence of +numbers from 1 to 26!
  2. +
  3. R has a built-in object called LETTERS. It is a +26-character vector, from A to Z. Set the names of the number sequence +to this 26 letters
  4. +
  5. Test yourself by calling letter_no["B"], which should +give you the number 2!
  6. +
+ +
letter_no <- 1:26   # or seq(1,26)
+names(letter_no) <- LETTERS

Data frames +


We have data frames at the very beginning of this lesson, they +represent a table of data. We didn’t go much further into detail with +our example cat data frame:


R +



    coat weight likes_string
+1 calico    2.1         TRUE
+2  black    5.0        FALSE
+3  tabby    3.2         TRUE

We can now understand something a bit surprising in our data.frame; +what happens if we run:


R +



[1] "list"

We see that data.frames look like lists ‘under the hood’. Think again +what we heard about what lists can be used for:


Lists organize data of different types


Columns of a data frame are vectors of different types, that are +organized by belonging to the same table.


A data.frame is really a list of vectors. It is a special list in +which all the vectors must have the same length.


How is this “special”-ness written into the object, so that R does +not treat it like any other list, but as a table?


R +



[1] "data.frame"

A class, just like names, is an attribute attached +to the object. It tells us what this object means for humans.


You might wonder: Why do we need another +what-type-of-object-is-this-function? We already have +typeof()? That function tells us how the object is +constructed in the computer. The class is +the meaning of the object for humans. Consequently, +what typeof() returns is fixed in R (mainly the +five data types), whereas the output of class() is +diverse and extendable by R packages.


In our cats example, we have an integer, a double and a +logical variable. As we have seen already, each column of data.frame is +a vector.


R +



[1] "calico" "black"  "tabby" 

R +



[1] "calico" "black"  "tabby" 

R +



[1] "character"

R +



 chr [1:3] "calico" "black" "tabby"

Each row is an observation of different variables, itself a +data.frame, and thus can be composed of elements of different types.


R +



    coat weight likes_string
+1 calico    2.1         TRUE

R +



[1] "list"

R +



'data.frame':	1 obs. of  3 variables:
+ $ coat        : chr "calico"
+ $ weight      : num 2.1
+ $ likes_string: logi TRUE
+ +

Challenge 5 +


There are several subtly different ways to call variables, +observations and elements from data.frames:

  • cats[1]
  • +
  • cats[[1]]
  • +
  • cats$coat
  • +
  • cats["coat"]
  • +
  • cats[1, 1]
  • +
  • cats[, 1]
  • +
  • cats[1, ]
  • +

Try out these examples and explain what is returned by each one.


Hint: Use the function typeof() to examine what +is returned in each case.

+ +

R +



+1 calico
+2  black
+3  tabby

We can think of a data frame as a list of vectors. The single brace +[1] returns the first slice of the list, as another list. +In this case it is the first column of the data frame.


R +



[1] "calico" "black"  "tabby" 

The double brace [[1]] returns the contents of the list +item. In this case it is the contents of the first column, a +vector of type character.


R +



[1] "calico" "black"  "tabby" 

This example uses the $ character to address items by +name. coat is the first column of the data frame, again a +vector of type character.


R +



+1 calico
+2  black
+3  tabby

Here we are using a single brace ["coat"] replacing the +index number with the column name. Like example 1, the returned object +is a list.


R +

+cats[1, 1]


[1] "calico"

This example uses a single brace, but this time we provide row and +column coordinates. The returned object is the value in row 1, column 1. +The object is a vector of type character.


R +

+cats[, 1]


[1] "calico" "black"  "tabby" 

Like the previous example we use single braces and provide row and +column coordinates. The row coordinate is not specified, R interprets +this missing value as all the elements in this column and +returns them as a vector.


R +

+cats[1, ]


    coat weight likes_string
+1 calico    2.1         TRUE

Again we use the single brace with row and column coordinates. The +column coordinate is not specified. The return value is a list +containing all the values in the first row.

+ +

Tip: Renaming data frame columns +


Data frames have column names, which can be accessed with the +names() function.


R +



[1] "coat"         "weight"       "likes_string"

If you want to rename the second column of cats, you can +assign a new name to the second element of names(cats).


R +

+names(cats)[2] <- "weight_kg"


    coat weight_kg likes_string
+1 calico       2.1         TRUE
+2  black       5.0        FALSE
+3  tabby       3.2         TRUE



Last but not least is the matrix. We can declare a matrix full of +zeros:


R +

+matrix_example <- matrix(0, ncol=6, nrow=3)


     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    0    0    0    0    0    0
+[2,]    0    0    0    0    0    0
+[3,]    0    0    0    0    0    0

What makes it special is the dim() attribute:


R +



[1] 3 6

And similar to other data structures, we can ask things about our +matrix:


R +



[1] "double"

R +



[1] "matrix" "array" 

R +



 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...

R +



[1] 3

R +



[1] 6
+ +

Challenge 6 +


What do you think will be the result of +length(matrix_example)? Try it. Were you right? Why / why +not?

+ +

What do you think will be the result of +length(matrix_example)?


R +

+matrix_example <- matrix(0, ncol=6, nrow=3)


[1] 18

Because a matrix is a vector with added dimension attributes, +length gives you the total number of elements in the +matrix.

+ +

Challenge 7 +


Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+ +

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)


R +

+x <- matrix(1:50, ncol=5, nrow=10)
+x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
+ +

Challenge 8 +


Create a list of length two containing a character vector for each of +the sections in this part of the workshop:

  • Data types
  • +
  • Data structures
  • +

Populate each character vector with the names of the data types and +data structures we’ve seen so far.

+ +

R +

+dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
+dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
+answer <- list(dataTypes, dataStructures)

Note: it’s nice to make a list in big writing on the board or taped +to the wall listing all of these types and structures - leave it up for +the rest of the workshop to remind people of the importance of these +basics.

+ +

Challenge 9 +


Consider the R output of the matrix below:



     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. +
  3. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  4. +
  5. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  6. +
  7. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
  8. +
+ +

Consider the R output of the matrix below:



     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.


R +

+matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
+ +

Keypoints +

  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +
+ + +
+ + + diff --git a/instructor/05-data-structures-part2.html b/instructor/05-data-structures-part2.html new file mode 100644 index 000000000..7e77d7ef2 --- /dev/null +++ b/instructor/05-data-structures-part2.html @@ -0,0 +1,1210 @@ + +R for Reproducible Scientific Analysis: Exploring Data Frames +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Exploring Data Frames


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 30 minutes

+ +
+ +
+ + + +




  • How can I manipulate a data frame?
  • +


  • Add and remove rows or columns.
  • +
  • Append two data frames.
  • +
  • Display basic properties of data frames including size and class of +the columns, names, and first few rows.
  • +

At this point, you’ve seen it all: in the last lesson, we toured all +the basic data types and data structures in R. Everything you do will be +a manipulation of those tools. But most of the time, the star of the +show is the data frame—the table that we created by loading information +from a csv file. In this lesson, we’ll learn a few more things about +working with data frames.


Adding columns and rows in data frames +


We already learned that the columns of a data frame are vectors, so +that our data are consistent in type throughout the columns. As such, if +we want to add a new column, we can start by making a new vector:


R +

+age <- c(2, 3, 5)


    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1

We can then add this as a column via:


R +

+cbind(cats, age)


    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5

Note that if we tried to add a vector of ages with a different number +of entries than the number of rows in the data frame, it would fail:


R +

+age <- c(2, 3, 5, 12)
+cbind(cats, age)


Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4

R +

+age <- c(2, 3)
+cbind(cats, age)


Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2

Why didn’t this work? Of course, R wants to see one element in our +new column for every row in the table:


R +



[1] 3

R +



[1] 2

So for it to work we need to have nrow(cats) = +length(age). Let’s overwrite the content of cats with our +new data frame.


R +

+age <- c(2, 3, 5)
+cats <- cbind(cats, age)

Now how about adding rows? We already know that the rows of a data +frame are lists:


R +

+newRow <- list("tortoiseshell", 3.3, TRUE, 9)
+cats <- rbind(cats, newRow)

Let’s confirm that our new row was added correctly.


R +



           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9

Removing rows +


We now know how to add rows and columns to our data frame in R. Now +let’s learn to remove rows.


R +



           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9

We can ask for a data frame minus the last row:


R +

+cats[-4, ]


    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5

Notice the comma with nothing after it to indicate that we want to +drop the entire fourth row.


Note: we could also remove several rows at once by putting the row +numbers inside of a vector, for example: +cats[c(-3,-4), ]


Removing columns +


We can also remove columns in our data frame. What if we want to +remove the column “age”. We can remove it in two ways, by variable +number or by index.


R +



           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1

Notice the comma with nothing before it, indicating we want to keep +all of the rows.


Alternatively, we can drop the column by using the index name and the +%in% operator. The %in% operator goes through +each element of its left argument, in this case the names of +cats, and asks, “Does this element occur in the second +argument?”


R +

+drop <- names(cats) %in% c("age")


           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1

We will cover subsetting with logical operators like +%in% in more detail in the next episode. See the section Subsetting through other logical +operations


Appending to a data frame +


The key to remember when adding data to a data frame is that +columns are vectors and rows are lists. We can also glue two +data frames together with rbind:


R +

+cats <- rbind(cats, cats)


           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9

But now the row names are unnecessarily complicated. We can remove +the rownames, and R will automatically re-name them sequentially:


R +

+rownames(cats) <- NULL


           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+ +

Challenge 1 +


You can create a new data frame right from within R with the +following syntax:


R +

+df <- data.frame(id = c("a", "b", "c"),
+                 x = 1:3,
+                 y = c(TRUE, TRUE, FALSE))

Make a data frame that holds the following information for +yourself:

  • first name
  • +
  • last name
  • +
  • lucky number
  • +

Then use rbind to add an entry for the people sitting +beside you. Finally, use cbind to add a column with each +person’s answer to the question, “Is it time for coffee break?”

+ +

R +

+df <- data.frame(first = c("Grace"),
+                 last = c("Hopper"),
+                 lucky_number = c(0))
+df <- rbind(df, list("Marie", "Curie", 238) )
+df <- cbind(df, coffeetime = c(TRUE,TRUE))

Realistic example +


So far, you have seen the basics of manipulating data frames with our +cat data; now let’s use those skills to digest a more realistic dataset. +Let’s read in the gapminder dataset that we downloaded +previously:


R +

+gapminder <- read.csv("data/gapminder_data.csv")
+ +

Miscellaneous Tips +

  • Another type of file you might encounter are tab-separated value +files (.tsv). To specify a tab as a separator, use "\\t" or +read.delim().

  • +
  • Files can also be downloaded directly from the Internet into a +local folder of your choice onto your computer using the +download.file function. The read.csv function +can then be executed to read the downloaded file from the download +location, for example,

  • +

R +

+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv("data/gapminder_data.csv")
  • Alternatively, you can also read in files directly into R from the +Internet by replacing the file paths with a web address in +read.csv. One should note that in doing this no local copy +of the csv file is first saved onto your computer. For example,
  • +

R +

+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
  • You can read directly from excel spreadsheets without converting +them to plain text first by using the readxl +package.

  • +
  • The argument “stringsAsFactors” can be useful to tell R how to +read strings either as factors or as character strings. In R versions +after 4.0, all strings are read-in as characters by default, but in +earlier versions of R, strings are read-in as factors by default. For +more information, see the call-out in the +previous episode.

  • +

Let’s investigate gapminder a bit; the first thing we should always +do is check out what the data looks like with str:


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...

An additional method for examining the structure of gapminder is to +use the summary function. This function can be used on +various objects in R. For data frames, summary yields a +numeric, tabular, or descriptive summary of each column. Numeric or +integer columns are described by the descriptive statistics (quartiles +and mean), and character columns by its length, class, and mode.


R +



   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  

Along with the str and summary functions, +we can examine individual columns of the data frame with our +typeof function:


R +



[1] "integer"

R +



[1] "character"

R +



 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

We can also interrogate the data frame for information about its +dimensions; remembering that str(gapminder) said there were +1704 observations of 6 variables in gapminder, what do you think the +following will produce, and why?


R +



[1] 6

A fair guess would have been to say that the length of a data frame +would be the number of rows it has (1704), but this is not the case; +remember, a data frame is a list of vectors and factors:


R +



[1] "list"

When length gave us 6, it’s because gapminder is built +out of a list of 6 columns. To get the number of rows and columns in our +dataset, try:


R +



[1] 1704

R +



[1] 6

Or, both at once:


R +



[1] 1704    6

We’ll also likely want to know what the titles of all the columns +are, so we can ask for them later:


R +



[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

At this stage, it’s important to ask ourselves if the structure R is +reporting matches our intuition or expectations; do the basic data types +reported for each column make sense? If not, we need to sort any +problems out now before they turn into bad surprises down the road, +using what we’ve learned about how R interprets data, and the importance +of strict consistency in how we record our data.


Once we’re happy that the data types and structures seem reasonable, +it’s time to start digging into our data proper. Check out the first few +lines:


R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+ +

Challenge 2 +


It’s good practice to also check the last few lines of your data and +some in the middle. How would you do this?


Searching for ones specifically in the middle isn’t too hard, but we +could ask for a few lines at random. How would you code this?

+ +

To check the last few lines it’s relatively simple as R already has a +function for this:


R +

+tail(gapminder, n = 15)

What about a few arbitrary rows just in case something is odd in the +middle?


Tip: There are several ways to achieve this.


The solution here presents one form of using nested functions, i.e. a +function passed as an argument to another function. This might sound +like a new concept, but you are already using it! Remember +my_dataframe[rows, cols] will print to screen your data frame with the +number of rows and columns you asked for (although you might have asked +for a range or named columns for example). How would you get the last +row if you don’t know how many rows your data frame has? R has a +function for this. What about getting a (pseudorandom) sample? R also +has a function for this.


R +

+gapminder[sample(nrow(gapminder), 5), ]

To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later.

+ +

Challenge 3 +


Go to file -> new file -> R script, and write an R script to +load in the gapminder dataset. Put it in the scripts/ +directory and add it to version control.


Run the script using the source function, using the file +path as its argument (or by pressing the “source” button in +RStudio).

+ +

The source function can be used to use a script within a +script. Assume you would like to load the same type of file over and +over again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again and +again you could just write it once and save it as a script. Then, you +can use source("Your_Script_containing_the_load_function") +in a new script to use the function of that script without writing +everything again. Check out ?source to find out more.


R +

+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv(file = "data/gapminder_data.csv")

To run the script and load the data into the gapminder +variable:


R +

+source(file = "scripts/load-gapminder.R")
+ +

Challenge 4 +


Read the output of str(gapminder) again; this time, use +what you’ve learned about lists and vectors, as well as the output of +functions like colnames and dim to explain +what everything that str prints out for gapminder means. If +there are any parts you can’t interpret, discuss with your +neighbors!

+ +

The object gapminder is a data frame with columns

  • +country and continent are character +strings.
  • +
  • +year is an integer vector.
  • +
  • +pop, lifeExp, and gdpPercap +are numeric vectors.
  • +
+ +

Keypoints +

  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +
+ + +
+ + + diff --git a/instructor/06-data-subsetting.html b/instructor/06-data-subsetting.html new file mode 100644 index 000000000..6496f90b1 --- /dev/null +++ b/instructor/06-data-subsetting.html @@ -0,0 +1,1992 @@ + +R for Reproducible Scientific Analysis: Subsetting Data +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Subsetting Data


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 50 minutes

+ +
+ +
+ + + +




  • How can I work with subsets of data in R?
  • +


  • To be able to subset vectors, factors, matrices, lists, and data +frames
  • +
  • To be able to extract individual and multiple elements: by index, by +name, using comparison operations
  • +
  • To be able to skip and remove elements from various data +structures.
  • +

R has many powerful subset operators. Mastering them will allow you +to easily perform complex operations on any kind of dataset.


There are six different ways we can subset any kind of object, and +three different subsetting operators for the different data +structures.


Let’s start with the workhorse of R: a simple numeric vector.


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+ +

Atomic vectors +


In R, simple vectors containing character strings, numbers, or +logical values are called atomic vectors because they can’t be +further simplified.


So now that we’ve created a dummy vector to play with, how do we get +at its contents?


Accessing elements using their indices +


To extract elements of a vector we can give their corresponding +index, starting from one:


R +




R +




It may look different, but the square brackets operator is a +function. For vectors (and matrices), it means “get me the nth +element”.


We can ask for multiple elements at once:


R +

+x[c(1, 3)]


  a   c 
+5.4 7.1 

Or slices of the vector:


R +



  a   b   c   d 
+5.4 6.2 7.1 4.8 

the : operator creates a sequence of numbers from the +left element to the right.


R +



[1] 1 2 3 4

R +

+c(1, 2, 3, 4)


[1] 1 2 3 4

We can ask for the same element multiple times:


R +



  a   a   c 
+5.4 5.4 7.1 

If we ask for an index beyond the length of the vector, R will return +a missing value:


R +



+  NA 

This is a vector of length one containing an NA, whose +name is also NA.


If we ask for the 0th element, we get an empty vector:


R +



named numeric(0)
+ +

Vector numbering in R starts at 1 +


In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1.


Skipping and removing elements +


If we use a negative number as the index of a vector, R will return +every element except for the one specified:


R +



  a   c   d   e 
+5.4 7.1 4.8 7.5 

We can skip multiple elements:


R +

+x[c(-1, -5)]  # or x[-c(1,5)]


  b   c   d 
+6.2 7.1 4.8 
+ +

Tip: Order of operations +


A common trip up for novices occurs when trying to skip slices of a +vector. It’s natural to try to negate a sequence like so:


R +


This gives a somewhat cryptic error:



Error in x[-1:3]: only 0's may be mixed with negative subscripts

But remember the order of operations. : is really a +function. It takes its first argument as -1, and its second as 3, so +generates the sequence of numbers: c(-1, 0, 1, 2, 3).


The correct solution is to wrap that function call in brackets, so +that the - operator applies to the result:


R +



  d   e 
+4.8 7.5 

To remove elements from a vector, we need to assign the result back +into the variable:


R +

+x <- x[-4]


  a   b   c   e 
+5.4 6.2 7.1 7.5 
+ +

Challenge 1 +


Given the following code:


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 

Come up with at least 2 different commands that will produce the +following output:



  b   c   d 
+6.2 7.1 4.8 

After you find 2 different commands, compare notes with your +neighbour. Did you have different strategies?

+ +

R +



  b   c   d 
+6.2 7.1 4.8 

R +



  b   c   d 
+6.2 7.1 4.8 

R +



  b   c   d 
+6.2 7.1 4.8 

Subsetting by name +


We can extract elements by using their name, instead of extracting by +index:


R +

+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
+x[c("a", "c")]


  a   c 
+5.4 7.1 

This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same!


Subsetting through other logical operations +


We can also use any logical vector to subset:


R +



  c   e 
+7.1 7.5 

Since comparison operators (e.g. >, +<, ==) evaluate to logical vectors, we can +also use them to succinctly subset vectors: the following statement +gives the same result as the previous one.


R +

+x[x > 7]


  c   e 
+7.1 7.5 

Breaking it down, this statement first evaluates x>7, +generating a logical vector +c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the +elements of x corresponding to the TRUE +values.


We can use == to mimic the previous method of indexing +by name (remember you have to use == rather than += for comparisons):


R +

+x[names(x) == "a"]


+ +

Tip: Combining logical conditions +


We often want to combine multiple logical criteria. For example, we +might want to find all the countries that are located in Asia +or Europe and have life expectancies +within a certain range. Several operations for combining logical vectors +exist in R:

  • +&, the “logical AND” operator: returns +TRUE if both the left and right are TRUE.
  • +
  • +|, the “logical OR” operator: returns +TRUE, if either the left or right (or both) are +TRUE.
  • +

You may sometimes see && and || +instead of & and |. These two-character +operators only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them for programming, i.e. deciding +whether to execute a statement.

  • +!, the “logical NOT” operator: converts +TRUE to FALSE and FALSE to +TRUE. It can negate a single logical condition (eg +!TRUE becomes FALSE), or a whole vector of +conditions(eg !c(TRUE, FALSE) becomes +c(FALSE, TRUE)).
  • +

Additionally, you can compare the elements within a single vector +using the all function (which returns TRUE if +every element of the vector is TRUE) and the +any function (which returns TRUE if one or +more elements of the vector are TRUE).

+ +

Challenge 2 +


Given the following code:


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 

Write a subsetting command to return the values in x that are greater +than 4 and less than 7.

+ +

R +

+x_subset <- x[x<7 & x>4]


  a   b   d 
+5.4 6.2 4.8 
+ +

Tip: Non-unique names +


You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have the +same name — although R tries to avoid this — but row names must be +unique.) Consider these examples:


R +

+x <- 1:3


[1] 1 2 3

R +

+names(x) <- c('a', 'a', 'a')


a a a 
+1 2 3 

R +

+x['a']  # only returns first value



R +

+x[names(x) == 'a']  # returns all three values


a a a 
+1 2 3 
+ +

Tip: Getting help for operators +


Remember you can search for help on operators by wrapping them in +quotes: help("%in%") or ?"%in%".


Skipping named elements +


Skipping or removing named elements is a little harder. If we try to +skip one named element by negating the string, R complains (slightly +obscurely) that it doesn’t know how to take the negative of a +string:


R +

+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'


Error in -"a": invalid argument to unary operator

However, we can use the != (not-equals) operator to +construct a logical vector that will do what we want:


R +

+x[names(x) != "a"]


  b   c   d   e 
+6.2 7.1 4.8 7.5 

Skipping multiple named indices is a little bit harder still. Suppose +we want to drop the "a" and "c" elements, so +we try this:


R +



Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length


  b   c   d   e 
+6.2 7.1 4.8 7.5 

R did something, but it gave us a warning that we ought to +pay attention to - and it apparently gave us the wrong answer +(the "c" element is still included in the vector)!


So what does != actually do in this case? That’s an +excellent question.




Let’s take a look at the comparison component of this code:


R +

+names(x) != c("a", "c")


Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length



Why does R give TRUE as the third element of this +vector, when names(x)[3] != "c" is obviously false? When +you use !=, R tries to compare each element of the left +argument with the corresponding element of its right argument. What +happens when you compare vectors of different lengths?

Inequality testing

When one vector is shorter than the other, it gets +recycled:

Inequality testing: results of recycling

In this case R repeats c("a", "c") as +many times as necessary to match names(x), i.e. we get +c("a","c","a","c","a"). Since the recycled "a" +doesn’t match the third element of names(x), the value of +!= is TRUE. Because in this case the longer +vector length (5) isn’t a multiple of the shorter vector length (2), R +printed a warning message. If we had been unlucky and +names(x) had contained six elements, R would +silently have done the wrong thing (i.e., not what we intended +it to do). This recycling rule can can introduce hard-to-find and subtle +bugs!


The way to get R to do what we really want (match each +element of the left argument with all of the elements of the +right argument) it to use the %in% operator. The +%in% operator goes through each element of its left +argument, in this case the names of x, and asks, “Does this +element occur in the second argument?”. Here, since we want to +exclude values, we also need a ! operator to +change “in” to “not in”:


R +

+x[! names(x) %in% c("a","c") ]


  b   d   e 
+6.2 4.8 7.5 
+ +

Challenge 3 +


Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains country and continent variables, but +no information between these two scales. Suppose we want to pull out +information from southeast Asia: how do we set up an operation to +produce a logical vector that is TRUE for all of the +countries in southeast Asia and FALSE otherwise?


Suppose you have these data:


R +

+seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries <- unique(as.character(gapminder$country))

There’s a wrong way (using only ==), which will give you +a warning; a clunky way (using the logical operators == and +|); and an elegant way (using %in%). See +whether you can come up with all three and explain how they (don’t) +work.

+ +
  • The wrong way to do this problem is +countries==seAsia. This gives a warning +("In countries == seAsia : longer object length is not a multiple of shorter object length") +and the wrong answer (a vector of all FALSE values), +because none of the recycled values of seAsia happen to +line up correctly with matching values in country.
  • +
  • The clunky (but technically correct) way to do this +problem is
  • +

R +

+ (countries=="Myanmar" | countries=="Thailand" |
+ countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")

(or countries==seAsia[1] | countries==seAsia[2] | ...). +This gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?).

  • The best way to do this problem is +countries %in% seAsia, which is both correct and easy to +type (and read).
  • +

Handling special values +


At some point you will encounter functions in R that cannot handle +missing, infinite, or undefined data.


There are a number of special functions you can use to filter out +this data:

  • +is.na will return all positions in a vector, matrix, or +data.frame containing NA (or NaN)
  • +
  • likewise, is.nan, and is.infinite will do +the same for NaN and Inf.
  • +
  • +is.finite will return all positions in a vector, +matrix, or data.frame that do not contain NA, +NaN or Inf.
  • +
  • +na.omit will filter out all missing values from a +vector
  • +

Factor subsetting +


Now that we’ve explored the different ways to subset vectors, how do +we subset the other data structures?


Factor subsetting works the same way as vector subsetting.


R +

+f <- factor(c("a", "a", "b", "c", "c", "d"))
+f[f == "a"]


[1] a a
+Levels: a b c d

R +

+f[f %in% c("b", "c")]


[1] b c c
+Levels: a b c d

R +



[1] a a b
+Levels: a b c d

Skipping elements will not remove the level even if no more of that +category exists in the factor:


R +



[1] a a c c d
+Levels: a b c d

Matrix subsetting +


Matrices are also subsetted using the [ function. In +this case it takes two arguments: the first applying to the rows, the +second to its columns:


R +

+m <- matrix(rnorm(6*4), ncol=4, nrow=6)
+m[3:4, c(3,1)]


            [,1]       [,2]
+[1,]  1.12493092 -0.8356286
+[2,] -0.04493361  1.5952808

You can leave the first or second arguments blank to retrieve all the +rows or columns respectively:


R +

+m[, c(3,4)]


            [,1]        [,2]
+[1,] -0.62124058  0.82122120
+[2,] -2.21469989  0.59390132
+[3,]  1.12493092  0.91897737
+[4,] -0.04493361  0.78213630
+[5,] -0.01619026  0.07456498
+[6,]  0.94383621 -1.98935170

If we only access one row or column, R will automatically convert the +result to a vector:


R +



[1] -0.8356286  0.5757814  1.1249309  0.9189774

If you want to keep the output as a matrix, you need to specify a +third argument; drop = FALSE:


R +

+m[3, , drop=FALSE]


           [,1]      [,2]     [,3]      [,4]
+[1,] -0.8356286 0.5757814 1.124931 0.9189774

Unlike vectors, if we try to access a row or column outside of the +matrix, R will throw an error:


R +

+m[, c(3,6)]


Error in m[, c(3, 6)]: subscript out of bounds
+ +

Tip: Higher dimensional arrays +


when dealing with multi-dimensional arrays, each argument to +[ corresponds to a dimension. For example, a 3D array, the +first three arguments correspond to the rows, columns, and depth +dimension.


Because matrices are vectors, we can also subset using only one +argument:


R +



[1] 0.3295078

This usually isn’t useful, and often confusing to read. However it is +useful to note that matrices are laid out in column-major +format by default. That is the elements of the vector are arranged +column-wise:


R +

+matrix(1:6, nrow=2, ncol=3)


     [,1] [,2] [,3]
+[1,]    1    3    5
+[2,]    2    4    6

If you wish to populate the matrix by row, use +byrow=TRUE:


R +

+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)


     [,1] [,2] [,3]
+[1,]    1    2    3
+[2,]    4    5    6

Matrices can also be subsetted using their rownames and column names +instead of their row and column indices.

+ +

Challenge 4 +


Given the following code:


R +

+m <- matrix(1:18, nrow=3, ncol=6)


     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    1    4    7   10   13   16
+[2,]    2    5    8   11   14   17
+[3,]    3    6    9   12   15   18
  1. Which of the following commands will extract the values 11 and +14?
  2. +

A. m[2,4,2,5]


B. m[2:5]


C. m[4:5,2]


D. m[2,c(4,5)]

+ +



List subsetting +


Now we’ll introduce some new subsetting operators. There are three +functions used to subset lists. We’ve already seen these when learning +about atomic vectors and matrices: [, [[, and +$.


Using [ will always return a list. If you want to +subset a list, but not extract an element, then you +will likely use [.


R +

+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))


+[1] "Software Carpentry"

This returns a list with one element.


We can subset elements of a list exactly the same way as atomic +vectors using [. Comparison operations however won’t work +as they’re not recursive, they will try to condition on the data +structures in each element of the list, not the individual elements +within those data structures.


R +



+[1] "Software Carpentry"
+ [1]  1  2  3  4  5  6  7  8  9 10

To extract individual elements of a list, you need to use the +double-square bracket function: [[.


R +



[1] "Software Carpentry"

Notice that now the result is a vector, not a list.


You can’t extract more than one element at once:


R +



Error in xlist[[1:2]]: subscript out of bounds

Nor use it to skip elements:


R +



Error in xlist[[-1]]: invalid negative subscript in get1index <real>

But you can use names to both subset and extract elements:


R +



[1] "Software Carpentry"

The $ function is a shorthand way for extracting +elements by name:


R +



                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
+Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+ +

Challenge 5 +


Given the following list:


R +

+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))

Using your knowledge of both list and vector subsetting, extract the +number 2 from xlist. Hint: the number 2 is contained within the “b” item +in the list.

+ +

R +



[1] 2

R +



[1] 2

R +



[1] 2
+ +

Challenge 6 +


Given a linear model:


R +

+mod <- aov(pop ~ lifeExp, data=gapminder)

Extract the residual degrees of freedom (hint: +attributes() will help you)

+ +

R +

+attributes(mod) ## `df.residual` is one of the names of `mod`

R +


Data frames +


Remember the data frames are lists underneath the hood, so similar +rules apply. However they are also two dimensional objects:


[ with one argument will act the same way as for lists, +where each list element corresponds to a column. The resulting object +will be a data frame:


R +



+1  8425333
+2  9240934
+3 10267083
+4 11537966
+5 13079460
+6 14880372

Similarly, [[ will act to extract a single +column:


R +



[1] 28.801 30.332 31.997 34.020 36.088 38.438

And $ provides a convenient shorthand to extract columns +by name:


R +



[1] 1952 1957 1962 1967 1972 1977

With two arguments, [ behaves the same way as for +matrices:


R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007

If we subset a single row, the result will be a data frame (because +the elements are mixed types):


R +



      country year      pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007

But for a single column the result will be a vector (this can be +changed with the third argument, drop = FALSE).

+ +

Challenge 7 +


Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
  2. +

R +

gapminder[gapminder$year = 1957,]
  1. Extract all columns except 1 through to 4
  2. +

R +

  1. Extract the rows where the life expectancy is longer the 80 +years
  2. +

R +

+gapminder[gapminder$lifeExp > 80]
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +

R +

+gapminder[1, 4, 5]
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +

R +

+gapminder[gapminder$year == 2002 | 2007,]
+ +

Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
  2. +

R +

+# gapminder[gapminder$year = 1957,]
+gapminder[gapminder$year == 1957,]
  1. Extract all columns except 1 through to 4
  2. +

R +

+# gapminder[,-1:4]
  1. Extract the rows where the life expectancy is longer than 80 +years
  2. +

R +

+# gapminder[gapminder$lifeExp > 80]
+gapminder[gapminder$lifeExp > 80,]
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +

R +

+# gapminder[1, 4, 5]
+gapminder[1, c(4, 5)]
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +

R +

+# gapminder[gapminder$year == 2002 | 2007,]
+gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
+gapminder[gapminder$year %in% c(2002, 2007),]
+ +

Challenge 8 +

  1. Why does gapminder[1:20] return an error? How does +it differ from gapminder[1:20, ]?

  2. +
  3. Create a new data.frame called +gapminder_small that only contains rows 1 through 9 and 19 +through 23. You can do this in one or two steps.

  4. +
+ +
  1. gapminder is a data.frame so needs to be subsetted +on two dimensions. gapminder[1:20, ] subsets the data to +give the first 20 rows and all columns.

  2. +
  3. +
  4. +

R +

+gapminder_small <- gapminder[c(1:9, 19:23),]
+ +

Keypoints +

  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +
+ + +
+ + + diff --git a/instructor/07-control-flow.html b/instructor/07-control-flow.html new file mode 100644 index 000000000..626f3d683 --- /dev/null +++ b/instructor/07-control-flow.html @@ -0,0 +1,1248 @@ + +R for Reproducible Scientific Analysis: Control Flow +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Control Flow


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 65 minutes

+ +
+ +
+ + + +




  • How can I make data-dependent choices in R?
  • +
  • How can I repeat operations in R?
  • +


  • Write conditional statements with if...else statements +and ifelse().
  • +
  • Write and understand for() loops.
  • +

Often when we’re coding we want to control the flow of our actions. +This can be done by setting actions to occur only if a condition or a +set of conditions are met. Alternatively, we can also set an action to +occur a particular number of times.


There are several ways you can control flow in R. For conditional +statements, the most commonly used approaches are the constructs:


R +

# if
+if (condition is true) {
+  perform action
+# if ... else
+if (condition is true) {
+  perform action
+} else {  # that is, if the condition is false,
+  perform alternative action

Say, for example, that we want R to print a message if a variable +x has a particular value:


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")


[1] 8

The print statement does not appear in the console because x is not +greater than 10. To print a different message for numbers less than 10, +we can add an else statement.


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else {
+  print("x is less than 10")


[1] "x is less than 10"

You can also test multiple conditions by using +else if.


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else if (x > 5) {
+  print("x is greater than 5, but less than 10")
+} else {
+  print("x is less than 5")


[1] "x is greater than 5, but less than 10"

Important: when R evaluates the condition inside +if() statements, it is looking for a logical element, i.e., +TRUE or FALSE. This can cause some headaches +for beginners. For example:


R +

+x  <-  4 == 3
+if (x) {
+  "4 equals 3"
+} else {
+  "4 does not equal 3"


[1] "4 does not equal 3"

As we can see, the not equal message was printed because the vector x +is FALSE


R +

+x <- 4 == 3


+ +

Challenge 1 +


Use an if() statement to print a suitable message +reporting whether there are any records from 2002 in the +gapminder dataset. Now do the same for 2012.

+ +

We will first see a solution to Challenge 1 which does not use the +any() function. We first obtain a logical vector describing +which element of gapminder$year is equal to +2002:


R +

+gapminder[(gapminder$year == 2002),]

Then, we count the number of rows of the data.frame +gapminder that correspond to the 2002:


R +

+rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])

The presence of any record for the year 2002 is equivalent to the +request that rows2002_number is one or more:


R +

+rows2002_number >= 1

Putting all together, we obtain:


R +

+if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
+   print("Record(s) for the year 2002 found.")

All this can be done more quickly with any(). The +logical condition can be expressed as:


R +

+if(any(gapminder$year == 2002)){
+   print("Record(s) for the year 2002 found.")

Did anyone get a warning message like this?



Error in if (gapminder$year == 2012) {: the condition has length > 1

The if() function only accepts singular (of length 1) +inputs, and therefore returns an error when you use it with a vector. +The if() function will still run, but will only evaluate +the condition in the first element of the vector. Therefore, to use the +if() function, you need to make sure your input is singular +(of length 1).

+ +

Tip: Built in ifelse() +function +


R accepts both if() and +else if() statements structured as outlined above, but also +statements using R’s built-in ifelse() +function. This function accepts both singular and vector inputs and is +structured as follows:


R +

# ifelse function
+ifelse(condition is true, perform action, perform alternative action)

where the first argument is the condition or a set of conditions to +be met, the second argument is the statement that is evaluated when the +condition is TRUE, and the third statement is the statement +that is evaluated when the condition is FALSE.


R +

+y <- -3
+ifelse(y < 0, "y is a negative number", "y is either positive or zero")


[1] "y is a negative number"
+ +

Tip: any() and +all() +


The any() function will return TRUE if at +least one TRUE value is found within a vector, otherwise it +will return FALSE. This can be used in a similar way to the +%in% operator. The function all(), as the name +suggests, will only return TRUE if all values in the vector +are TRUE.


Repeating operations +


If you want to iterate over a set of values, when the order of +iteration is important, and perform the same operation on each, a +for() loop will do the job. We saw for() loops +in the shell +lessons earlier. This is the most flexible of looping operations, +but therefore also the hardest to use correctly. In general, the advice +of many R users would be to learn about for() +loops, but to avoid using for() loops unless the order of +iteration is important: i.e. the calculation at each iteration depends +on the results of previous iterations. If the order of iteration is not +important, then you should learn about vectorized alternatives, such as +the purrr package, as they pay off in computational +efficiency.


The basic structure of a for() loop is:


R +

for (iterator in set of values) {
+  do a thing

For example:


R +

+for (i in 1:10) {
+  print(i)


[1] 1
+[1] 2
+[1] 3
+[1] 4
+[1] 5
+[1] 6
+[1] 7
+[1] 8
+[1] 9
+[1] 10

The 1:10 bit creates a vector on the fly; you can +iterate over any other vector as well.


We can use a for() loop nested within another +for() loop to iterate over two things at once.


R +

+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    print(paste(i,j))
+  }


[1] "1 a"
+[1] "1 b"
+[1] "1 c"
+[1] "1 d"
+[1] "1 e"
+[1] "2 a"
+[1] "2 b"
+[1] "2 c"
+[1] "2 d"
+[1] "2 e"
+[1] "3 a"
+[1] "3 b"
+[1] "3 c"
+[1] "3 d"
+[1] "3 e"
+[1] "4 a"
+[1] "4 b"
+[1] "4 c"
+[1] "4 d"
+[1] "4 e"
+[1] "5 a"
+[1] "5 b"
+[1] "5 c"
+[1] "5 d"
+[1] "5 e"

We notice in the output that when the first index (i) is +set to 1, the second index (j) iterates through its full +set of indices. Once the indices of j have been iterated +through, then i is incremented. This process continues +until the last index has been used for each for() loop.


Rather than printing the results, we could write the loop output to a +new object.


R +

+output_vector <- c()
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    temp_output <- paste(i, j)
+    output_vector <- c(output_vector, temp_output)
+  }


 [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b"
+[13] "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b" "5 c" "5 d"
+[25] "5 e"

This approach can be useful, but ‘growing your results’ (building the +result object incrementally) is computationally inefficient, so avoid it +when you are iterating through a lot of values.

+ +

Tip: don’t grow your results +


One of the biggest things that trips up novices and experienced R +users alike, is building a results object (vector, list, matrix, data +frame) as your for loop progresses. Computers are very bad at handling +this, so your calculations can very quickly slow to a crawl. It’s much +better to define an empty results object before hand of appropriate +dimensions, rather than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location.


A better way is to define your (empty) output object before filling +in the values. For this example, it looks more involved, but is still +more efficient.


R +

+output_matrix <- matrix(nrow = 5, ncol = 5)
+j_vector <- c('a', 'b', 'c', 'd', 'e')
+for (i in 1:5) {
+  for (j in 1:5) {
+    temp_j_value <- j_vector[j]
+    temp_output <- paste(i, temp_j_value)
+    output_matrix[i, j] <- temp_output
+  }
+output_vector2 <- as.vector(output_matrix)


 [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c" "2 c"
+[13] "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e" "3 e" "4 e"
+[25] "5 e"
+ +

Tip: While loops +


Sometimes you will find yourself needing to repeat an operation as +long as a certain condition is met. You can do this with a +while() loop.


R +

while(this condition is true){
+  do a thing

R will interpret a condition being met as “TRUE”.


As an example, here’s a while loop that generates random numbers from +a uniform distribution (the runif() function) between 0 and +1 until it gets one that’s less than 0.1.


R +

+z <- 1
+while(z > 0.1){
+  z <- runif(1)
+  cat(z, "\n")

while() loops will not always be appropriate. You have +to be particularly careful that you don’t end up stuck in an infinite +loop because your condition is always met and hence the while statement +never terminates.

+ +

Challenge 2 +


Compare the objects output_vector and +output_vector2. Are they the same? If not, why not? How +would you change the last block of code to make +output_vector2 the same as output_vector?

+ +

We can check whether the two vectors are identical using the +all() function:


R +

+all(output_vector == output_vector2)

However, all the elements of output_vector can be found +in output_vector2:


R +

+all(output_vector %in% output_vector2)

and vice versa:


R +

+all(output_vector2 %in% output_vector)

therefore, the element in output_vector and +output_vector2 are just sorted in a different order. This +is because as.vector() outputs the elements of an input +matrix going over its column. Taking a look at +output_matrix, we can notice that we want its elements by +rows. The solution is to transpose the output_matrix. We +can do it either by calling the transpose function t() or +by inputting the elements in the right order. The first solution +requires to change the original


R +

+output_vector2 <- as.vector(output_matrix)



R +

+output_vector2 <- as.vector(t(output_matrix))

The second solution requires to change


R +

+output_matrix[i, j] <- temp_output



R +

+output_matrix[j, i] <- temp_output
+ +

Challenge 3 +


Write a script that loops through the gapminder data by +continent and prints out whether the mean life expectancy is smaller or +larger than 50 years.

+ +

Step 1: We want to make sure we can extract all the +unique values of the continent vector


R +

+gapminder <- read.csv("data/gapminder_data.csv")

Step 2: We also need to loop over each of these +continents and calculate the average life expectancy for each +subset of data. We can do that as follows:

  1. Loop over each of the unique values of ‘continent’
  2. +
  3. For each value of continent, create a temporary variable storing +that subset
  4. +
  5. Return the calculated life expectancy to the user by printing the +output:
  6. +

R +

+for (iContinent in unique(gapminder$continent)) {
+  tmp <- gapminder[gapminder$continent == iContinent, ]
+  cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n")
+  rm(tmp)

Step 3: The exercise only wants the output printed +if the average life expectancy is less than 50 or greater than 50. So we +need to add an if() condition before printing, which +evaluates whether the calculated average life expectancy is above or +below a threshold, and prints an output conditional on the result. We +need to amend (3) from above:


3a. If the calculated life expectancy is less than some threshold (50 +years), return the continent and a statement that life expectancy is +less than threshold, otherwise return the continent and a statement that +life expectancy is greater than threshold:


R +

+thresholdValue <- 50
+for (iContinent in unique(gapminder$continent)) {
+   tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"])
+   if (tmp < thresholdValue){
+       cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+   } else {
+       cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+   } # end if else condition
+   rm(tmp)
+} # end for loop
+ +

Challenge 4 +


Modify the script from Challenge 3 to loop over each country. This +time print out whether the life expectancy is smaller than 50, between +50 and 70, or greater than 70.

+ +

We modify our solution to Challenge 3 by now adding two thresholds, +lowerThreshold and upperThreshold and +extending our if-else statements:


R +

+ lowerThreshold <- 50
+ upperThreshold <- 70
+for (iCountry in unique(gapminder$country)) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+    if(tmp < lowerThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
+    } else if(tmp > lowerThreshold && tmp < upperThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
+    } else {
+        cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
+    }
+    rm(tmp)
+ +

Challenge 5 - Advanced +


Write a script that loops over each country in the +gapminder dataset, tests whether the country starts with a +‘B’, and graphs life expectancy against time as a line graph if the mean +life expectancy is under 50 years.

+ +

We will use the grep() command that was introduced in +the Unix +Shell lesson to find countries that start with “B.” Lets understand +how to do this first. Following from the Unix shell section we may be +tempted to try the following


R +

+grep("^B", unique(gapminder$country))

But when we evaluate this command it returns the indices of the +factor variable country that start with “B.” To get the +values, we must add the value=TRUE option to the +grep() command:


R +

+grep("^B", unique(gapminder$country), value = TRUE)

We will now store these countries in a variable called +candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each +country, and if the average life expectancy is less than 50 we use +base-plot to plot the evolution of average life expectancy using +with() and subset():


R +

+thresholdValue <- 50
+candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE)
+for (iCountry in candidateCountries) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+    if (tmp < thresholdValue) {
+        cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+        with(subset(gapminder, country == iCountry),
+                plot(year, lifeExp,
+                     type = "o",
+                     main = paste("Life Expectancy in", iCountry, "over time"),
+                     ylab = "Life Expectancy",
+                     xlab = "Year"
+                     ) # end plot
+             ) # end with
+    } # end if
+    rm(tmp)
+} # end for loop
+ +

Keypoints +

  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +
+ + +
+ + + diff --git a/instructor/08-plot-ggplot2.html b/instructor/08-plot-ggplot2.html new file mode 100644 index 000000000..d82021e2e --- /dev/null +++ b/instructor/08-plot-ggplot2.html @@ -0,0 +1,1106 @@ + +R for Reproducible Scientific Analysis: Creating Publication-Quality Graphics with ggplot2 +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Creating Publication-Quality Graphics with ggplot2


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 80 minutes

+ +
+ +
+ + + +




  • How can I create publication-quality graphics in R?
  • +


  • To be able to use ggplot2 to generate publication-quality +graphics.
  • +
  • To apply geometry, aesthetic, and statistics layers to a ggplot +plot.
  • +
  • To manipulate the aesthetics of a plot using different colors, +shapes, and lines.
  • +
  • To improve data visualization through transforming scales and +paneling by group.
  • +
  • To save a plot created with ggplot to disk.
  • +

Plotting our data is one of the best ways to quickly explore it and +the various relationships between variables.


There are three main plotting systems in R, the base plotting +system, the lattice +package, and the ggplot2 +package.


Today we’ll be learning about the ggplot2 package, because it is the +most effective for creating publication-quality graphics.


ggplot2 is built on the grammar of graphics, the idea that any plot +can be built from the same set of components: a data +set, mapping aesthetics, and graphical +layers:

  • Data sets are the data that you, the user, +provide.

  • +
  • Mapping aesthetics are what connect the data to +the graphics. They tell ggplot2 how to use your data to affect how the +graph looks, such as changing what is plotted on the X or Y axis, or the +size or color of different data points.

  • +
  • Layers are the actual graphical output from +ggplot2. Layers determine what kinds of plot are shown (scatterplot, +histogram, etc.), the coordinate system used (rectangular, polar, +others), and other important aspects of the plot. The idea of layers of +graphics may be familiar to you if you have used image editing programs +like Photoshop, Illustrator, or Inkscape.

  • +

Let’s start off building an example using the gapminder data from +earlier. The most basic function is ggplot, which lets R +know that we’re creating a new plot. Any of the arguments we give the +ggplot function are the global options for the +plot: they apply to all layers on the plot.


R +

+ggplot(data = gapminder)
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to +show on our figure. This is not enough information for +ggplot to actually draw anything. It only creates a blank +slate for other elements to be added to.


Now we’re going to add in the mapping aesthetics +using the aes function. aes tells +ggplot how variables in the data map to +aesthetic properties of the figure, such as which columns of +the data should be used for the x and +y locations.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

Here we told ggplot we want to plot the “gdpPercap” +column of the gapminder data frame on the x-axis, and the “lifeExp” +column on the y-axis. Notice that we didn’t need to explicitly pass +aes these columns +(e.g. x = gapminder[, "gdpPercap"]), this is because +ggplot is smart enough to know to look in the +data for that column!


The final part of making our plot is to tell ggplot how +we want to visually represent the data. We do this by adding a new +layer to the plot using one of the +geom functions.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

Here we used geom_point, which tells ggplot +we want to visually represent the relationship between +x and y as a scatterplot of +points.

+ +

Challenge 1 +


Modify the example so that the figure shows how life expectancy has +changed over time:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()

Hint: the gapminder dataset has a column called “year”, which should +appear on the x-axis.

+ +

Here is one possible solution:


R +

+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +
+ +

Challenge 2 +


In the previous examples and challenge we’ve used the +aes function to tell the scatterplot geom +about the x and y locations of each +point. Another aesthetic property we can modify is the point +color. Modify the code from the previous challenge to +color the points by the “continent” column. What trends +do you see in the data? Are they what you expected?

+ +

The solution presented below adds color=continent to the +call of the aes function. The general trend seems to +indicate an increased life expectancy over the years. On continents with +stronger economies we find a longer life expectancy.


R +

+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_point()
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +

Layers +


Using a scatterplot probably isn’t the best for visualizing change +over time. Instead, let’s tell ggplot to visualize the data +as a line plot:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
+  geom_line()

Instead of adding a geom_point layer, we’ve added a +geom_line layer.


However, the result doesn’t look quite as we might have expected: it +seems to be jumping around a lot in each continent. Let’s try to +separate the data by country, plotting one line for each country:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line()

We’ve added the group aesthetic, which +tells ggplot to draw a line for each country.


But what if we want to visualize both lines and points on the plot? +We can add another layer to the plot:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line() + geom_point()

It’s important to note that each layer is drawn on top of the +previous layer. In this example, the points have been drawn on top +of the lines. Here’s a demonstration:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+  geom_line(mapping = aes(color=continent)) + geom_point()

In this example, the aesthetic mapping of +color has been moved from the global plot options in +ggplot to the geom_line layer so it no longer +applies to the points. Now we can clearly see that the points are drawn +on top of the lines.

+ +

Tip: Setting an aesthetic to a value instead +of a mapping +


So far, we’ve seen how to use an aesthetic (such as +color) as a mapping to a variable in the data. +For example, when we use +geom_line(mapping = aes(color=continent)), ggplot will give +a different color to each continent. But what if we want to change the +color of all lines to blue? You may think that +geom_line(mapping = aes(color="blue")) should work, but it +doesn’t. Since we don’t want to create a mapping to a specific variable, +we can move the color specification outside of the aes() +function, like this: geom_line(color="blue").

+ +

Challenge 3 +


Switch the order of the point and line layers from the previous +example. What happened?

+ +

The lines now get drawn over the points!


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_point() + geom_line(mapping = aes(color=continent))
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

Transformations and statistics +


ggplot2 also makes it easy to overlay statistical models over the +data. To demonstrate we’ll go back to our first example:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()

Currently it’s hard to see the relationship between the points due to +some strong outliers in GDP per capita. We can change the scale of units +on the x axis using the scale functions. These control the +mapping between the data values and visual values of an aesthetic. We +can also modify the transparency of the points, using the alpha +function, which is especially helpful when you have a large amount of +data which is very clustered.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10()
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

The scale_x_log10 function applied a transformation to +the coordinate system of the plot, so that each multiple of 10 is evenly +spaced from left to right. For example, a GDP per capita of 1,000 is the +same horizontal distance away from a value of 10,000 as the 10,000 value +is from 100,000. This helps to visualize the spread of the data along +the x-axis.

+ +

Tip Reminder: Setting an aesthetic to a value +instead of a mapping +


Notice that we used geom_point(alpha = 0.5). As the +previous tip mentioned, using a setting outside of the +aes() function will cause this value to be used for all +points, which is what we want in this case. But just like any other +aesthetic setting, alpha can also be mapped to a variable in +the data. For example, we can give a different transparency to each +continent with +geom_point(mapping = aes(alpha = continent)).


We can fit a simple relationship to the data by adding another layer, +geom_smooth:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the +size aesthetic in the geom_smooth +layer:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", size=1.5)


Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we +set the size aesthetic by passing it as an +argument to geom_smooth. Previously in the lesson we’ve +used the aes function to define a mapping between +data variables and their visual representation.

+ +

Challenge 4a +


Modify the color and size of the points on the point layer in the +previous example.


Hint: do not use the aes function.

+ +

Here a possible solution: Notice that the color argument +is supplied outside of the aes() function. This means that +it applies to all data points on the graph and is not related to a +specific variable.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(size=3, color="orange") + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+ +

Challenge 4b +


Modify your solution to Challenge 4a so that the points are now a +different shape and are colored by continent with new trendlines. Hint: +The color argument can be used inside the aesthetic.

+ +

Here is a possible solution: Notice that supplying the +color argument inside the aes() functions +enables you to connect it to a certain variable. The shape +argument, as you can see, modifies all data points the same way (it is +outside the aes() call) while the color +argument which is placed inside the aes() call modifies a +point’s color based on its continent value.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
+ geom_point(size=3, shape=17) + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)


`geom_smooth()` using formula = 'y ~ x'

Multi-panel figures +


Earlier we visualized the change in life expectancy over time across +all countries in one plot. Alternatively, we can split this out over +multiple panels by adding a layer of facet panels.

+ +

Tip +


We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to clutter +the figure. Note that we apply a “theme” definition to rotate the x-axis +labels to maintain readability. Nearly everything in ggplot2 is +customizable.


R +

+americas <- gapminder[gapminder$continent == "Americas",]
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

The facet_wrap layer took a “formula” as its argument, +denoted by the tilde (~). This tells R to draw a panel for each unique +value in the country column of the gapminder dataset.


Modifying text +


To clean this figure up for a publication we need to change some of +the text elements. The x-axis is too cluttered, and the y axis should +read “Life expectancy”, rather than the column name in the data +frame.


We can do this by adding a couple of different layers. The +theme layer controls the axis text, and overall text +size. Labels for the axes, plot title and any legend can be set using +the labs function. Legend titles are set using the same +names we used in the aes specification. Thus below the +color legend title is set using color = "Continent", while +the title of a fill legend would be set using +fill = "MyTitle".


R +

+ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Exporting the plot +


The ggsave() function allows you to export a plot +created with ggplot. You can specify the dimension and resolution of +your plot by adjusting the appropriate arguments (width, +height and dpi) to create high quality +graphics for publication. In order to save the plot from above, we first +assign it to a variable lifeExp_plot, then tell +ggsave to save that plot in png format to a +directory called results. (Make sure you have a +results/ folder in your working directory.)


R +

+lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")

There are two nice things about ggsave. First, it +defaults to the last plot, so if you omit the plot argument +it will automatically save the last plot you created with +ggplot. Secondly, it tries to determine the format you want +to save your plot in from the file extension you provide for the +filename (for example .png or .pdf). If you +need to, you can specify the format explicitly in the +device argument.


This is a taste of what you can do with ggplot2. RStudio provides a +really useful cheat +sheet of the different layers available, and more extensive +documentation is available on the ggplot2 website. All +RStudio cheat sheets can be found here. Finally, +if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow +with reusable code to modify!

+ +

Challenge 5 +


Generate boxplots to compare life expectancy between the different +continents during the available years.



  • Rename y axis as Life Expectancy.
  • +
  • Remove x axis labels.
  • +
+ +

Here a possible solution: xlab() and ylab() +set labels for the x and y axes, respectively The axis title, text and +ticks are attributes of the theme and must be modified within a +theme() call.


R +

+ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
+ geom_boxplot() + facet_wrap(~year) +
+ ylab("Life Expectancy") +
+ theme(axis.title.x=element_blank(),
+       axis.text.x = element_blank(),
+       axis.ticks.x = element_blank())
+ +

Keypoints +

  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +
+ + +
+ + + diff --git a/instructor/09-vectorization.html b/instructor/09-vectorization.html new file mode 100644 index 000000000..d8750ac6f --- /dev/null +++ b/instructor/09-vectorization.html @@ -0,0 +1,1021 @@ + +R for Reproducible Scientific Analysis: Vectorization +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +



Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 25 minutes

+ +
+ +
+ + + +




  • How can I operate on all the elements of a vector at once?
  • +


  • To understand vectorized operations in R.
  • +

Most of R’s functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through and +act on each element one at a time. This makes writing code more concise, +easy to read, and less error prone.


R +

+x <- 1:4
+x * 2


[1] 2 4 6 8

The multiplication happened to each element of the vector.


We can also add two vectors together:


R +

+y <- 6:9
+x + y


[1]  7  9 11 13

Each element of x was added to its corresponding element +of y:


R +

x:  1  2  3  4
+    +  +  +  +
+y:  6  7  8  9
+    7  9 11 13

Here is how we would add two vectors together using a for loop:


R +

+output_vector <- c()
+for (i in 1:4) {
+  output_vector[i] <- x[i] + y[i]


[1]  7  9 11 13

Compare this to the output using vectorised operations.


R +

+sum_xy <- x + y


[1]  7  9 11 13
+ +

Challenge 1 +


Let’s try this on the pop column of the +gapminder dataset.


Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+ +

Let’s try this on the pop column of the +gapminder dataset.


Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.


R +

+gapminder$pop_millions <- gapminder$pop / 1e6


      country year      pop continent lifeExp gdpPercap pop_millions
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453     8.425333
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530     9.240934
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007    10.267083
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971    11.537966
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811    13.079460
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134    14.880372
+ +

Challenge 2 +


On a single graph, plot population, in millions, against year, for +all countries. Do not worry about identifying which country is +which.


Repeat the exercise, graphing only for China, India, and Indonesia. +Again, do not worry about which is which.

+ +

Refresh your plotting skills by plotting population in millions +against year.


R +

+ggplot(gapminder, aes(x = year, y = pop_millions)) +
+ geom_point()
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

R +

+countryset <- c("China","India","Indonesia")
+ggplot(gapminder[gapminder$country %in% countryset,],
+       aes(x = year, y = pop_millions)) +
+  geom_point()
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

Comparison operators, logical operators, and many functions are also +vectorized:


Comparison operators


R +

+x > 2



Logical operators


R +

+a <- x > 3  # or, for clarity, a <- (x > 3)


+ +

Tip: some useful functions for logical +vectors +


any() will return TRUE if any +element of a vector is TRUE.
all() will return TRUE if all +elements of a vector are TRUE.


Most functions also operate element-wise on vectors:




R +

+x <- 1:4


[1] 0.0000000 0.6931472 1.0986123 1.3862944

Vectorized operations work element-wise on matrices:


R +

+m <- matrix(1:12, nrow=3, ncol=4)
+m * -1


     [,1] [,2] [,3] [,4]
+[1,]   -1   -4   -7  -10
+[2,]   -2   -5   -8  -11
+[3,]   -3   -6   -9  -12
+ +

Tip: element-wise vs. matrix +multiplication +


Very important: the operator * gives you element-wise +multiplication! To do matrix multiplication, we need to use the +%*% operator:


R +

+m %*% matrix(1, nrow=4, ncol=1)


+[1,]   22
+[2,]   26
+[3,]   30

R +

+matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)


+[1,]   30

For more on matrix algebra, see the Quick-R +reference guide

+ +

Challenge 3 +


Given the following matrix:


R +

+m <- matrix(1:12, nrow=3, ncol=4)


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12

Write down what you think will happen when you run:

  1. m ^ -1
  2. +
  3. m * c(1, 0, -1)
  4. +
  5. m > c(0, 20)
  6. +
  7. m * c(1, 0, -1, 2)
  8. +

Did you get the output you expected? If not, ask a helper!

+ +

Given the following matrix:


R +

+m <- matrix(1:12, nrow=3, ncol=4)


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12

Write down what you think will happen when you run:

  1. m ^ -1
  2. +


          [,1]      [,2]      [,3]       [,4]
+[1,] 1.0000000 0.2500000 0.1428571 0.10000000
+[2,] 0.5000000 0.2000000 0.1250000 0.09090909
+[3,] 0.3333333 0.1666667 0.1111111 0.08333333
  1. m * c(1, 0, -1)
  2. +


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    0    0    0    0
+[3,]   -3   -6   -9  -12
  1. m > c(0, 20)
  2. +


      [,1]  [,2]  [,3]  [,4]
+ +

Challenge 4 +


We’re interested in looking at the sum of the following sequence of +fractions:


R +

+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000?

+ +

We’re interested in looking at the sum of the following sequence of +fractions:


R +

+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

This would be tedious to type out, and impossible for high values of +n. Can you use vectorisation to compute x, when n=100? How about when +n=10,000?


R +



[1] 1.634984

R +



[1] 1.644834

R +

+n <- 10000


[1] 1.644834

We can also obtain the same results using a function:


R +

+inverse_sum_of_squares <- function(n) {
+  sum(1/(1:n)^2)


[1] 1.634984

R +



[1] 1.644834

R +

+n <- 10000


[1] 1.644834
+ +

Tip: Operations on vectors of unequal +length +


Operations can also be performed on vectors of unequal length, +through a process known as recycling. This process +automatically repeats the smaller vector until it matches the length of +the larger vector. R will provide a warning if the larger vector is not +a multiple of the smaller vector.


R +

+x <- c(1, 2, 3)
+y <- c(1, 2, 3, 4, 5, 6, 7)
+x + y


Warning in x + y: longer object length is not a multiple of shorter object


[1] 2 4 6 5 7 9 8

Vector x was recycled to match the length of vector +y


R +

x:  1  2  3  1  2  3  1
+    +  +  +  +  +  +  +
+y:  1  2  3  4  5  6  7
+    2  4  6  5  7  9  8
+ +

Keypoints +

  • Use vectorized operations instead of loops.
  • +
+ + + +
+ + +
+ + + diff --git a/instructor/10-functions.html b/instructor/10-functions.html new file mode 100644 index 000000000..c723aee66 --- /dev/null +++ b/instructor/10-functions.html @@ -0,0 +1,1222 @@ + +R for Reproducible Scientific Analysis: Functions Explained +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Functions Explained


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 60 minutes

+ +
+ +
+ + + +




  • How can I write a new function in R?
  • +


  • Define a function that takes arguments.
  • +
  • Return a value from a function.
  • +
  • Check argument conditions with stopifnot() in +functions.
  • +
  • Test a function.
  • +
  • Set default values for function arguments.
  • +
  • Explain why we should divide programs into small, single-purpose +functions.
  • +

If we only had one data set to analyze, it would probably be faster +to load the file into a spreadsheet and use that to plot simple +statistics. However, the gapminder data is updated periodically, and we +may want to pull in that new information later and re-run our analysis +again. We may also obtain similar data from a different source in the +future.


In this lesson, we’ll learn how to write a function so that we can +repeat several operations with a single command.

+ +

What is a function? +


Functions gather a sequence of operations into a whole, preserving it +for ongoing use. Functions provide:

  • a name we can remember and invoke it by
  • +
  • relief from the need to remember the individual operations
  • +
  • a defined set of inputs and expected outputs
  • +
  • rich connections to the larger programming environment
  • +

As the basic building block of most programming languages, +user-defined functions constitute “programming” as much as any single +abstraction can. If you have written a function, you are a computer +programmer.


Defining a function +


Let’s open a new R script file in the functions/ +directory and call it functions-lesson.R.


The general structure of a function is:


R +

+my_function <- function(parameters) {
+  # perform action
+  # return value

Let’s define a function fahr_to_kelvin() that converts +temperatures from Fahrenheit to Kelvin:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

We define fahr_to_kelvin() by assigning it to the output +of function. The list of argument names are contained +within parentheses. Next, the body of +the function–the statements that are executed when it runs–is contained +within curly braces ({}). The statements in the body are +indented by two spaces. This makes the code easier to read but does not +affect how the code operates.


It is useful to think of creating functions like writing a cookbook. +First you define the “ingredients” that your function needs. In this +case, we only need one ingredient to use our function: “temp”. After we +list our ingredients, we then say what we will do with them, in this +case, we are taking our ingredient and applying a set of mathematical +operators to it.


When we call the function, the values we pass to it as arguments are +assigned to those variables so that we can use them inside the function. +Inside the function, we use a return statement to send a +result back to whoever asked for it.

+ +

Tip +


One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the +body of the function. But for clarity, we will explicitly define the +return statement.


Let’s try running our function. Calling our own function is no +different from calling any other function:


R +

+# freezing point of water


[1] 273.15

R +

+# boiling point of water


[1] 373.15
+ +

Challenge 1 +


Write a function called kelvin_to_celsius() that takes a +temperature in Kelvin and returns that temperature in Celsius.


Hint: To convert from Kelvin to Celsius you subtract 273.15

+ +

Write a function called kelvin_to_celsius that takes a +temperature in Kelvin and returns that temperature in Celsius


R +

+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)

Combining functions +


The real power of functions comes from mixing, matching and combining +them into ever-larger chunks to get the effect we want.


Let’s define two functions that will convert temperature from +Fahrenheit to Kelvin, and Kelvin to Celsius:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+kelvin_to_celsius <- function(temp) {
+  celsius <- temp - 273.15
+  return(celsius)
+ +

Challenge 2 +


Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer).

+ +

Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above


R +

+fahr_to_celsius <- function(temp) {
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)

Interlude: Defensive Programming +


Now that we’ve begun to appreciate how writing functions provides an +efficient way to make R code re-usable and modular, we should note that +it is important to ensure that functions only work in their intended +use-cases. Checking function parameters is related to the concept of +defensive programming. Defensive programming encourages us to +frequently check conditions and throw an error if something is wrong. +These checks are referred to as assertion statements because we want to +assert some condition is TRUE before proceeding. They make +it easier to debug because they give us a better idea of where the +errors originate.


Checking conditions with stopifnot() +


Let’s start by re-examining fahr_to_kelvin(), our +function for converting temperatures from Fahrenheit to Kelvin. It was +defined like so:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

For this function to work as intended, the argument temp +must be a numeric value; otherwise, the mathematical +procedure for converting between the two temperature scales will not +work. To create an error, we can use the function stop(). +For example, since the argument temp must be a +numeric vector, we could check for this condition with an +if statement and throw an error if the condition was +violated. We could augment our function above like so:


R +

+fahr_to_kelvin <- function(temp) {
+  if (!is.numeric(temp)) {
+    stop("temp must be a numeric vector.")
+  }
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

If we had multiple conditions or arguments to check, it would take +many lines of code to check all of them. Luckily R provides the +convenience function stopifnot(). We can list as many +requirements that should evaluate to TRUE; +stopifnot() throws an error if it finds one that is +FALSE. Listing these conditions also serves a secondary +purpose as extra documentation for the function.


Let’s try out defensive programming with stopifnot() by +adding assertions to check the input to our function +fahr_to_kelvin().


We want to assert the following: temp is a numeric +vector. We may do that like so:


R +

+fahr_to_kelvin <- function(temp) {
+  stopifnot(is.numeric(temp))
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

It still works when given proper input.


R +

+# freezing point of water
+fahr_to_kelvin(temp = 32)


[1] 273.15

But fails instantly if given improper input.


R +

+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp = as.factor(32))


Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+ +

Challenge 3 +


Use defensive programming to ensure that our +fahr_to_celsius() function throws an error immediately if +the argument temp is specified inappropriately.

+ +

Extend our previous definition of the function by adding in an +explicit call to stopifnot(). Since +fahr_to_celsius() is a composition of two other functions, +checking inside here makes adding checks to the two component functions +redundant.


R +

+fahr_to_celsius <- function(temp) {
+  stopifnot(is.numeric(temp))
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)

More on combining functions +


Now, we’re going to define a function that calculates the Gross +Domestic Product of a nation from the data available in our dataset:


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat) {
+  gdp <- dat$pop * dat$gdpPercap
+  return(gdp)

We define calcGDP() by assigning it to the output of +function. The list of argument names are contained within +parentheses. Next, the body of the function -- the statements executed +when you call the function – is contained within curly braces +({}).


We’ve indented the statements in the body by two spaces. This makes +the code easier to read but does not affect how it operates.


When we call the function, the values we pass to it are assigned to +the arguments, which become variables inside the body of the +function.


Inside the function, we use the return() function to +send back the result. This return() function is optional: R +will automatically return the results of whatever command is executed on +the last line of the function.


R +



[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231

That’s not very informative. Let’s add some more arguments so we can +extract that per year and country.


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

If you’ve been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by +using the source() function:


R +


Ok, so there’s a lot going on in this function now. In plain English, +the function now subsets the provided data by year if the year argument +isn’t empty, then subsets the result by country if the country argument +isn’t empty. Then it calculates the GDP for whatever subset emerges from +the previous two steps. The function then adds the GDP as a new column +to the subsetted data and returns this as the final result. You can see +that the output is much more informative than a vector of numbers.


Let’s take a look at what happens when we specify the year:


R +

+head(calcGDP(gapminder, year=2007))


       country year      pop continent lifeExp  gdpPercap          gdp
+12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
+24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
+36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
+48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
+60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
+72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894

Or for a specific country:


R +

+calcGDP(gapminder, country="Australia")


     country year      pop continent lifeExp gdpPercap          gdp
+61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
+62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
+63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
+64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
+65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
+66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
+67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
+68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
+69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
+70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
+71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Or both:


R +

+calcGDP(gapminder, year=2007, country="Australia")


     country year      pop continent lifeExp gdpPercap          gdp
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Let’s walk through the body of the function:


R +

calcGDP <- function(dat, year=NULL, country=NULL) {

Here we’ve added two arguments, year, and +country. We’ve set default arguments for both as +NULL using the = operator in the function +definition. This means that those arguments will take on those values +unless the user specifies otherwise.


R +

+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }

Here, we check whether each additional argument is set to +null, and whenever they’re not null overwrite +the dataset stored in dat with a subset given by the +non-null argument.


Building these conditionals into the function makes it more flexible +for later. Now, we can use it to calculate the GDP for:

  • The whole dataset;
  • +
  • A single year;
  • +
  • A single country;
  • +
  • A single combination of year and country.
  • +

By using %in% instead, we can also give multiple years +or countries to those arguments.

+ +

Tip: Pass by value +


Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify dat inside the +function we are modifying the copy of the gapminder dataset stored in +dat, not the original variable we gave as the first +argument.


This is called “pass-by-value” and it makes writing code much safer: +you can always be sure that whatever changes you make within the body of +the function, stay inside the body of the function.

+ +

Tip: Function scope +


Another important concept is scoping: any variables (or functions!) +you create or modify inside the body of a function only exist for the +lifetime of the function’s execution. When we call +calcGDP(), the variables dat, gdp +and new only exist inside the body of the function. Even if +we have variables of the same name in our interactive R session, they +are not modified in any way when executing a function.


R +

  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

Finally, we calculated the GDP on our new subset, and created a new +data frame with that column added. This means when we call the function +later we can see the context for the returned GDP values, which is much +better than in our first attempt where we got a vector of numbers.

+ +

Challenge 4 +


Test out your GDP function by calculating the GDP for New Zealand in +1987. How does this differ from New Zealand’s GDP in 1952?

+ +

R +

+  calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand")

GDP for New Zealand in 1987: 65050008703


GDP for New Zealand in 1952: 21058193787

+ +

Challenge 5 +


The paste() function can be used to combine text +together, e.g:


R +

+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+paste(best_practice, collapse=" ")


[1] "Write programs for people not computers"

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:


R +

+fence(text=best_practice, wrapper="***")

Note: the paste() function has an argument +called sep, which specifies the separator between text. The +default is a space: ” “. The default for paste0() is no +space”“.

+ +

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:


R +

+fence <- function(text, wrapper){
+  text <- c(wrapper, text, wrapper)
+  result <- paste(text, collapse = " ")
+  return(result)
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+fence(text=best_practice, wrapper="***")


[1] "*** Write programs for people not computers ***"
+ +

Tip +


R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the R +Language Manual or this chapter from Advanced R Programming by Hadley +Wickham.

+ +

Tip: Testing and documenting +


It’s important to both test functions and document them: +Documentation helps you, and others, understand what the purpose of your +function is, and how to use it, and its important to make sure that your +function actually does what you think.


When you first start out, your workflow will probably look a lot like +this:

  1. Write a function
  2. +
  3. Comment parts of the function to document its behaviour
  4. +
  5. Load in the source file
  6. +
  7. Experiment with it in the console to make sure it behaves as you +expect
  8. +
  9. Make any necessary bug fixes
  10. +
  11. Rinse and repeat.
  12. +

Formal documentation for functions, written in separate +.Rd files, gets turned into the documentation you see in +help files. The roxygen2 +package allows R coders to write documentation alongside the function +code and then process it into the appropriate .Rd files. +You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In +fact, packages are, in essence, bundles of functions with this formal +documentation. Loading your own functions through +source("functions.R") is equivalent to loading someone +else’s functions (or your own one day!) through +library("package").


Formal automated tests can be written using the testthat package.

+ +

Keypoints +

  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +
+ + +
+ + + diff --git a/instructor/11-writing-data.html b/instructor/11-writing-data.html new file mode 100644 index 000000000..c536390e7 --- /dev/null +++ b/instructor/11-writing-data.html @@ -0,0 +1,688 @@ + +R for Reproducible Scientific Analysis: Writing Data +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Writing Data


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 20 minutes

+ +
+ +
+ + + +




  • How can I save plots and data created in R?
  • +


  • To be able to write out plots and data from R.
  • +

Saving plots +


You have already seen how to save the most recent plot you create in +ggplot2, using the command ggsave. As a +refresher:


R +


You can save a plot from within RStudio using the ‘Export’ button in +the ‘Plot’ window. This will give you the option of saving as a .pdf or +as .png, .jpg or other image formats.


Sometimes you will want to save plots without creating them in the +‘Plot’ window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you’re looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can’t stop +the loop to click ‘Export’ for each one.


In this case you can use a more flexible approach. The function +pdf creates a new pdf device. You can control the size and +resolution using the arguments to this function.


R +

+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
+  geom_line() +
+  theme(legend.position = "none")
+# You then have to make sure to turn off the pdf device!

Open up this document and have a look.

+ +

Challenge 1 +


Rewrite your ‘pdf’ command to print a second page in the pdf, showing +a facet plot (hint: use facet_grid) of the same data with +one panel per continent.

+ +

R +

+pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
+p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
+  geom_line() +
+  theme(legend.position = "none")
+p + facet_grid(~continent)

The commands jpeg, png etc. are used +similarly to produce documents in different formats.


Writing data +


At some point, you’ll also want to write out data from R.


We can use the write.table function for this, which is +very similar to read.table from before.


Let’s create a data-cleaning script, for this analysis, we only want +to focus on the gapminder data for Australia:


R +

+aust_subset <- gapminder[gapminder$country == "Australia",]
+  file="cleaned-data/gapminder-aus.csv",
+  sep=","

Let’s switch back to the shell to take a look at the data to make +sure it looks OK:



head cleaned-data/gapminder-aus.csv



Hmm, that’s not quite what we wanted. Where did all these quotation +marks come from? Also the row numbers are meaningless.


Let’s look at the help file to work out how to change this +behaviour.


R +


By default R will wrap character vectors with quotation marks when +writing out to file. It will also write out the row and column +names.


Let’s fix this:


R +

+  gapminder[gapminder$country == "Australia",],
+  file="cleaned-data/gapminder-aus.csv",
+  sep=",", quote=FALSE, row.names=FALSE

Now lets look at the data again using our shell skills:



head cleaned-data/gapminder-aus.csv



That looks better!

+ +

Challenge 2 +


Write a data-cleaning script file that subsets the gapminder data to +include only data points collected since 1990.


Use this script to write out the new subset to a file in the +cleaned-data/ directory.

+ +

R +

+  gapminder[gapminder$year > 1990, ],
+  file = "cleaned-data/gapminder-after1990.csv",
+  sep = ",", quote = FALSE, row.names = FALSE
+ +

Keypoints +

  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +
+ + +
+ + + diff --git a/instructor/12-plyr.html b/instructor/12-plyr.html new file mode 100644 index 000000000..77fa8c1cf --- /dev/null +++ b/instructor/12-plyr.html @@ -0,0 +1,1012 @@ + +R for Reproducible Scientific Analysis: Splitting and Combining Data Frames with plyr +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Splitting and Combining Data Frames with plyr


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 60 minutes

+ +
+ +
+ + + +




  • How can I do different calculations on different sets of data?
  • +


  • To be able to use the split-apply-combine strategy for data +analysis.
  • +

Previously we looked at how you can use functions to simplify your +code. We defined the calcGDP function, which takes the +gapminder dataset, and multiplies the population and GDP per capita +column. We also defined additional arguments so we could filter by +year and country:


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

A common task you’ll encounter when working with data, is that you’ll +want to run calculations on different groups within the data. In the +above, we were calculating the GDP by multiplying two columns together. +But what if we wanted to calculated the mean GDP per continent?


We could run calcGDP and then take the mean of each +continent:


R +

+withGDP <- calcGDP(gapminder)
+mean(withGDP[withGDP$continent == "Africa", "gdp"])


[1] 20904782844

R +

+mean(withGDP[withGDP$continent == "Americas", "gdp"])


[1] 379262350210

R +

+mean(withGDP[withGDP$continent == "Asia", "gdp"])


[1] 227233738153

But this isn’t very nice. Yes, by using a function, you have +reduced a substantial amount of repetition. That is +nice. But there is still repetition. Repeating yourself will cost you +time, both now and later, and potentially introduce some nasty bugs.


We could write a new function that is flexible like +calcGDP, but this also takes a substantial amount of effort +and testing to get right.


The abstract problem we’re encountering here is know as +“split-apply-combine”:

Split apply combine

We want to split our data into groups, in this case +continents, apply some calculations on that group, then +optionally combine the results together afterwards.


The plyr package +


For those of you who have used R before, you might be familiar with +the apply family of functions. While R’s built in functions +do work, we’re going to introduce you to another method for solving the +“split-apply-combine” problem. The plyr package provides a set of +functions that we find more user friendly for solving this problem.


We installed this package in an earlier challenge. Let us load it +now:


R +


Plyr has functions for operating on lists, +data.frames and arrays (matrices, or +n-dimensional vectors). Each function performs:

  1. A splitting operation
  2. +
  3. +Apply a function on each split in turn.
  4. +
  5. Recombine output data as a single data object.
  6. +

The functions are named based on the data structure they expect as +input, and the data structure you want returned as output: [a]rray, +[l]ist, or [d]ata.frame. The first letter corresponds to the input data +structure, the second letter to the output data structure, and then the +rest of the function is named “ply”.


This gives us 9 core functions **ply. There are an additional three +functions which will only perform the split and apply steps, and not any +combine step. They’re named by their input data type and represent null +output by a _ (see table)


Note here that plyr’s use of “array” is different to R’s, an array in +ply can include a vector or matrix.

Full apply suite

Each of the xxply functions (daply, ddply, +llply, laply, …) has the same structure and +has 4 key features and structure:


R +

+xxply(.data, .variables, .fun)
  • The first letter of the function name gives the input type and the +second gives the output type.
  • +
  • .data - gives the data object to be processed
  • +
  • .variables - identifies the splitting variables
  • +
  • .fun - gives the function to be called on each piece
  • +

Now we can quickly calculate the mean GDP per continent:


R +

+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)


  continent           V1
+1    Africa  20904782844
+2  Americas 379262350210
+3      Asia 227233738153
+4    Europe 269442085301
+5   Oceania 188187105354

Let us walk through the previous code:

  • The ddply function feeds in a data.frame +(function starts with d) and returns another +data.frame (2nd letter is a d)
  • +
  • the first argument we gave was the data.frame we wanted to operate +on: in this case the gapminder data. We called calcGDP on +it first so that it would have the additional gdp column +added to it.
  • +
  • The second argument indicated our split criteria: in this case the +“continent” column. Note that we gave the name of the column, not the +values of the column like we had done previously with subsetting. Plyr +takes care of these implementation details for you.
  • +
  • The third argument is the function we want to apply to each grouping +of the data. We had to define our own short function here: each subset +of the data gets stored in x, the first argument of our +function. This is an anonymous function: we haven’t defined it +elsewhere, and it has no name. It only exists in the scope of our call +to ddply.
  • +
+ +

Challenge 1 +


Calculate the average life expectancy per continent. Which has the +longest? Which has the shortest?

+ +

R +

+ .data = gapminder,
+ .variables = "continent",
+ .fun = function(x) mean(x$lifeExp)

Oceania has the longest and Africa the shortest.


What if we want a different type of output data structure?:


R +

+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)


+[1] 20904782844
+[1] 379262350210
+[1] 227233738153
+[1] 269442085301
+[1] 188187105354
+[1] "data.frame"
+  continent
+1    Africa
+2  Americas
+3      Asia
+4    Europe
+5   Oceania

We called the same function again, but changed the second letter to +an l, so the output was returned as a list.


We can specify multiple columns to group by:


R +

+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)


   continent year           V1
+1     Africa 1952   5992294608
+2     Africa 1957   7359188796
+3     Africa 1962   8784876958
+4     Africa 1967  11443994101
+5     Africa 1972  15072241974
+6     Africa 1977  18694898732
+7     Africa 1982  22040401045
+8     Africa 1987  24107264108
+9     Africa 1992  26256977719
+10    Africa 1997  30023173824
+11    Africa 2002  35303511424
+12    Africa 2007  45778570846
+13  Americas 1952 117738997171
+14  Americas 1957 140817061264
+15  Americas 1962 169153069442
+16  Americas 1967 217867530844
+17  Americas 1972 268159178814
+18  Americas 1977 324085389022
+19  Americas 1982 363314008350
+20  Americas 1987 439447790357
+21  Americas 1992 489899820623
+22  Americas 1997 582693307146
+23  Americas 2002 661248623419
+24  Americas 2007 776723426068
+25      Asia 1952  34095762661
+26      Asia 1957  47267432088
+27      Asia 1962  60136869012
+28      Asia 1967  84648519224
+29      Asia 1972 124385747313
+30      Asia 1977 159802590186
+31      Asia 1982 194429049919
+32      Asia 1987 241784763369
+33      Asia 1992 307100497486
+34      Asia 1997 387597655323
+35      Asia 2002 458042336179
+36      Asia 2007 627513635079
+37    Europe 1952  84971341466
+38    Europe 1957 109989505140
+39    Europe 1962 138984693095
+40    Europe 1967 173366641137
+41    Europe 1972 218691462733
+42    Europe 1977 255367522034
+43    Europe 1982 279484077072
+44    Europe 1987 316507473546
+45    Europe 1992 342703247405
+46    Europe 1997 383606933833
+47    Europe 2002 436448815097
+48    Europe 2007 493183311052
+49   Oceania 1952  54157223944
+50   Oceania 1957  66826828013
+51   Oceania 1962  82336453245
+52   Oceania 1967 105958863585
+53   Oceania 1972 134112109227
+54   Oceania 1977 154707711162
+55   Oceania 1982 176177151380
+56   Oceania 1987 209451563998
+57   Oceania 1992 236319179826
+58   Oceania 1997 289304255183
+59   Oceania 2002 345236880176
+60   Oceania 2007 403657044512

R +

+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)


+continent          1952         1957         1962         1967         1972
+  Africa     5992294608   7359188796   8784876958  11443994101  15072241974
+  Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+  Asia      34095762661  47267432088  60136869012  84648519224 124385747313
+  Europe    84971341466 109989505140 138984693095 173366641137 218691462733
+  Oceania   54157223944  66826828013  82336453245 105958863585 134112109227
+          year
+continent          1977         1982         1987         1992         1997
+  Africa    18694898732  22040401045  24107264108  26256977719  30023173824
+  Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+  Asia     159802590186 194429049919 241784763369 307100497486 387597655323
+  Europe   255367522034 279484077072 316507473546 342703247405 383606933833
+  Oceania  154707711162 176177151380 209451563998 236319179826 289304255183
+          year
+continent          2002         2007
+  Africa    35303511424  45778570846
+  Americas 661248623419 776723426068
+  Asia     458042336179 627513635079
+  Europe   436448815097 493183311052
+  Oceania  345236880176 403657044512

You can use these functions in place of for loops (and +it is usually faster to do so). To replace a for loop, put the code that +was in the body of the for loop inside an anonymous +function.


R +

+  .data=gapminder,
+  .variables = "continent",
+  .fun = function(x) {
+    meanGDPperCap <- mean(x$gdpPercap)
+    print(paste(
+      "The mean GDP per capita for", unique(x$continent),
+      "is", format(meanGDPperCap, big.mark=",")
+   ))
+  }


[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+ +

Tip: printing numbers +


The format function can be used to make numeric values +“pretty” for printing out in messages.

+ +

Challenge 2 +


Calculate the average life expectancy per continent and year. Which +had the longest and shortest in 2007? Which had the greatest change in +between 1952 and 2007?

+ +

R +

+solution <- ddply(
+ .data = gapminder,
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$lifeExp)
+solution_2007 <- solution[solution$year == 2007, ]

Oceania had the longest average life expectancy in 2007 and Africa +the lowest.


R +

+solution_1952_2007 <- cbind(solution[solution$year == 1952, ], solution_2007)
+difference_1952_2007 <- data.frame(continent = solution_1952_2007$continent,
+                                   year_1957 = solution_1952_2007[[3]],
+                                   year_2007 = solution_1952_2007[[6]],
+                                   difference = solution_1952_2007[[6]] - solution_1952_2007[[3]])

Asia had the greatest difference, and Oceania the least.

+ +

Alternate Challenge +


Without running them, which of the following will calculate the +average life expectancy per continent:

  1. +

R +

+  .data = gapminder,
+  .variables = gapminder$continent,
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = mean(dataGroup$lifeExp)
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+ +

Answer 3 will calculate the average life expectancy per +continent.

+ +

Keypoints +

  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +
+ + +
+ + + diff --git a/instructor/13-dplyr.html b/instructor/13-dplyr.html new file mode 100644 index 000000000..048694649 --- /dev/null +++ b/instructor/13-dplyr.html @@ -0,0 +1,1240 @@ + +R for Reproducible Scientific Analysis: Data Frame Manipulation with dplyr +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Data Frame Manipulation with dplyr


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 55 minutes

+ +
+ +
+ + + +




  • How can I manipulate data frames without repeating myself?
  • +


  • To be able to use the six main data frame manipulation ‘verbs’ with +pipes in dplyr.
  • +
  • To understand how group_by() and +summarize() can be combined to summarize datasets.
  • +
  • Be able to analyze a subset of data using logical filtering.
  • +

Manipulation of data frames means many things to many researchers: we +often select certain observations (rows) or variables (columns), we +often group the data by a certain variable(s), or we even calculate +summary statistics. We can do these operations using the normal base R +operations:


R +

+mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])


[1] 2193.755

R +

+mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])


[1] 7136.11

R +

+mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])


[1] 7902.15

But this isn’t very nice because there is a fair bit of +repetition. Repeating yourself will cost you time, both now and later, +and potentially introduce some nasty bugs.


The dplyr package +


Luckily, the dplyr +package provides a number of very useful functions for manipulating data +frames in a way that will reduce the above repetition, reduce the +probability of making errors, and probably even save you some typing. As +an added bonus, you might even find the dplyr grammar +easier to read.

+ +

Tip: Tidyverse +


dplyr package belongs to a broader family of opinionated +R packages designed for data science called the “Tidyverse”. These +packages are specifically designed to work harmoniously together. Some +of these packages will be covered along this course, but you can find +more complete information here: https://www.tidyverse.org/.


Here we’re going to cover 5 of the most commonly used functions as +well as using pipes (%>%) to combine them.

  1. select()
  2. +
  3. filter()
  4. +
  5. group_by()
  6. +
  7. summarize()
  8. +
  9. mutate()
  10. +

If you have have not installed this package earlier, please do +so:


R +


Now let’s load the package:


R +


Using select() +


If, for example, we wanted to move forward with only a few of the +variables in our data frame we could use the select() +function. This will keep only the variables you select.


R +

+year_country_gdp <- select(gapminder, year, country, gdpPercap)

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.


R +

+smaller_gapminder_data <- select(gapminder, -continent)

If we open up year_country_gdp we’ll see that it only +contains the year, country and gdpPercap. Above we used ‘normal’ +grammar, but the strengths of dplyr lie in combining +several functions using pipes. Since the pipes grammar is unlike +anything we’ve seen in R before, let’s repeat what we’ve done above +using pipes.


R +

+year_country_gdp <- gapminder %>% select(year, country, gdpPercap)

To help you understand why we wrote that in that way, let’s walk +through it step by step. First we summon the gapminder data frame and +pass it on, using the pipe symbol %>%, to the next step, +which is the select() function. In this case we don’t +specify which data object we use in the select() function +since in gets that from the previous pipe. Fun Fact: +There is a good chance you have encountered pipes before in the shell. +In R, a pipe symbol is %>% while in the shell it is +| but the concept is the same!

+ +

Tip: Renaming data frame columns in dplyr +


In Chapter 4 we covered how you can rename columns with base R by +assigning a value to the output of the names() function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a +rename() function.


Within a pipeline, the syntax is +rename(new_name = old_name). For example, we may want to +rename the gdpPercap column name from our select() +statement above.


R +

+tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)


  year     country gdp_per_capita
+1 1952 Afghanistan       779.4453
+2 1957 Afghanistan       820.8530
+3 1962 Afghanistan       853.1007
+4 1967 Afghanistan       836.1971
+5 1972 Afghanistan       739.9811
+6 1977 Afghanistan       786.1134

Using filter() +


If we now want to move forward with the above, but only with European +countries, we can combine select and +filter


R +

+year_country_gdp_euro <- gapminder %>%
+    filter(continent == "Europe") %>%
+    select(year, country, gdpPercap)

If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below.


R +

+europe_lifeExp_2007 <- gapminder %>%
+  filter(continent == "Europe", year == 2007) %>%
+  select(country, lifeExp)
+ +

Challenge 1 +


Write a single command (which can span multiple lines and includes +pipes) that will produce a data frame that has the African values for +lifeExp, country and year, but +not for other Continents. How many rows does your data frame have and +why?

+ +

R +

+year_country_lifeExp_Africa <- gapminder %>%
+                           filter(continent == "Africa") %>%
+                           select(year, country, lifeExp)

As with last time, first we pass the gapminder data frame to the +filter() function, then we pass the filtered version of the +gapminder data frame to the select() function. +Note: The order of operations is very important in this +case. If we used ‘select’ first, filter would not be able to find the +variable continent since we would have removed it in the previous +step.


Using group_by() +


Now, we were supposed to be reducing the error prone repetitiveness +of what can be done with base R, but up to now we haven’t done that +since we would have to repeat the above for each continent. Instead of +filter(), which will only pass observations that meet your +criteria (in the above: continent=="Europe"), we can use +group_by(), which will essentially use every unique +criteria that you could have used in filter.


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...

R +

+str(gapminder %>% group_by(continent))


gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
+  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
+  ..$ .rows    : list<int> [1:5] 
+  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
+  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
+  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
+  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
+  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
+  .. ..@ ptype: int(0) 
+  ..- attr(*, ".drop")= logi TRUE

You will notice that the structure of the data frame where we used +group_by() (grouped_df) is not the same as the +original gapminder (data.frame). A +grouped_df can be thought of as a list where +each item in the listis a data.frame which +contains only the rows that correspond to the a particular value +continent (at least in the example above).

Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize() +


The above was a bit on the uneventful side but +group_by() is much more exciting in conjunction with +summarize(). This will allow us to create new variable(s) +by using functions that repeat for each of the continent-specific data +frames. That is to say, using the group_by() function, we +split our original data frame into multiple pieces, then we can run +functions (e.g. mean() or sd()) within +summarize().


R +

+gdp_bycontinents <- gapminder %>%
+    group_by(continent) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
Diagram illustrating the use of group by and summarize together to create a new variable

R +

continent mean_gdpPercap
+     <fctr>          <dbl>
+1    Africa       2193.755
+2  Americas       7136.110
+3      Asia       7902.150
+4    Europe      14469.476
+5   Oceania      18621.609

That allowed us to calculate the mean gdpPercap for each continent, +but it gets even better.

+ +

Challenge 2 +


Calculate the average life expectancy per country. Which has the +longest average life expectancy and which has the shortest average life +expectancy?

+ +

R +

+lifeExp_bycountry <- gapminder %>%
+   group_by(country) %>%
+   summarize(mean_lifeExp = mean(lifeExp))
+lifeExp_bycountry %>%
+   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))


# A tibble: 2 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Iceland              76.5
+2 Sierra Leone         36.8

Another way to do this is to use the dplyr function +arrange(), which arranges the rows in a data frame +according to the order of one or more variables from the data frame. It +has similar syntax to other functions from the dplyr +package. You can use desc() inside arrange() +to sort in descending order.


R +

+lifeExp_bycountry %>%
+   arrange(mean_lifeExp) %>%
+   head(1)


# A tibble: 1 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Sierra Leone         36.8

R +

+lifeExp_bycountry %>%
+   arrange(desc(mean_lifeExp)) %>%
+   head(1)


# A tibble: 1 × 2
+  country mean_lifeExp
+  <chr>          <dbl>
+1 Iceland         76.5

Alphabetical order works too


R +

+lifeExp_bycountry %>%
+   arrange(desc(country)) %>%
+   head(1)


# A tibble: 1 × 2
+  country  mean_lifeExp
+  <chr>           <dbl>
+1 Zimbabwe         52.7

The function group_by() allows us to group by multiple +variables. Let’s group by year and +continent.


R +

+gdp_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

That is already quite powerful, but it gets even better! You’re not +limited to defining 1 new variable in summarize().


R +

+gdp_pop_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

count() and n() +


A very common operation is to count the number of observations for +each group. The dplyr package comes with two related +functions that help with this.


For instance, if we wanted to check the number of countries included +in the dataset for the year 2002, we can use the count() +function. It takes the name of one or more columns that contain the +groups we are interested in, and we can optionally sort the results in +descending order by adding sort=TRUE:


R +

+gapminder %>%
+    filter(year == 2002) %>%
+    count(continent, sort = TRUE)


  continent  n
+1    Africa 52
+2      Asia 33
+3    Europe 30
+4  Americas 25
+5   Oceania  2

If we need to use the number of observations in calculations, the +n() function is useful. It will return the total number of +observations in the current group rather than counting the number of +observations in each group within a specific column. For instance, if we +wanted to get the standard error of the life expectency per +continent:


R +

+gapminder %>%
+    group_by(continent) %>%
+    summarize(se_le = sd(lifeExp)/sqrt(n()))


# A tibble: 5 × 2
+  continent se_le
+  <chr>     <dbl>
+1 Africa    0.366
+2 Americas  0.540
+3 Asia      0.596
+4 Europe    0.286
+5 Oceania   0.775

You can also chain together several summary operations; in this case +calculating the minimum, maximum, +mean and se of each continent’s per-country +life-expectancy:


R +

+gapminder %>%
+    group_by(continent) %>%
+    summarize(
+      mean_le = mean(lifeExp),
+      min_le = min(lifeExp),
+      max_le = max(lifeExp),
+      se_le = sd(lifeExp)/sqrt(n()))


# A tibble: 5 × 5
+  continent mean_le min_le max_le se_le
+  <chr>       <dbl>  <dbl>  <dbl> <dbl>
+1 Africa       48.9   23.6   76.4 0.366
+2 Americas     64.7   37.6   80.7 0.540
+3 Asia         60.1   28.8   82.6 0.596
+4 Europe       71.9   43.6   81.8 0.286
+5 Oceania      74.3   69.1   81.2 0.775

Using mutate() +


We can also create new variables prior to (or even after) summarizing +information using mutate().


R +

+gdp_pop_bycontinents_byyear <- gapminder %>%
+    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
+    group_by(continent,year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

Connect mutate with logical filtering: ifelse +


When creating new variables, we can hook this with a logical +condition. A simple combination of mutate() and +ifelse() facilitates filtering right where it is needed: in +the moment of creating something new. This easy-to-read statement is a +fast and powerful way of discarding certain data (even though the +overall dimension of the data frame will not change) or for updating +values depending on this given condition.


R +

+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
+    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

R +

+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
+    mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              mean_gdpPercap_expected = mean(gdp_futureExpectation))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

Combining dplyr and ggplot2 +


First install and load ggplot2:


R +


R +


In the plotting lesson we looked at how to make a multi-panel figure +by adding a layer of facet panels using ggplot2. Here is +the code we used (with some extra comments):


R +

+# Filter countries located in the Americas
+americas <- gapminder[gapminder$continent == "Americas", ]
+# Make the plot
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

This code makes the right plot but it also creates an intermediate +variable (americas) that we might not have any other uses +for. Just as we used %>% to pipe data along a chain of +dplyr functions we can use it to pass data to +ggplot(). Because %>% replaces the first +argument in a function we don’t need to specify the data = +argument in the ggplot() function. By combining +dplyr and ggplot2 functions we can make the +same figure without creating any new variables or modifying the +data.


R +

+gapminder %>%
+  # Filter countries located in the Americas
+  filter(continent == "Americas") %>%
+  # Make the plot
+  ggplot(mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

More examples of using the function mutate() and the +ggplot2 package.


R +

+gapminder %>%
+  # extract first letter of country name into new column
+  mutate(startsWith = substr(country, 1, 1)) %>%
+  # only keep countries starting with A or Z
+  filter(startsWith %in% c("A", "Z")) %>%
+  # plot lifeExp into facets
+  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
+  geom_line() +
+  facet_wrap(vars(country)) +
+  theme_minimal()
+ +

Advanced Challenge +


Calculate the average life expectancy in 2002 of 2 randomly selected +countries for each continent. Then arrange the continent names in +reverse order. Hint: Use the dplyr +functions arrange() and sample_n(), they have +similar syntax to other dplyr functions.

+ +

R +

+lifeExp_2countries_bycontinents <- gapminder %>%
+   filter(year==2002) %>%
+   group_by(continent) %>%
+   sample_n(2) %>%
+   summarize(mean_lifeExp=mean(lifeExp)) %>%
+   arrange(desc(mean_lifeExp))

Other great resources +

+ +

Keypoints +

  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +
+ + +
+ + + diff --git a/instructor/14-tidyr.html b/instructor/14-tidyr.html new file mode 100644 index 000000000..1b636a826 --- /dev/null +++ b/instructor/14-tidyr.html @@ -0,0 +1,1161 @@ + +R for Reproducible Scientific Analysis: Data Frame Manipulation with tidyr +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Data Frame Manipulation with tidyr


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 45 minutes

+ +
+ +
+ + + +




  • How can I change the layout of a data frame?
  • +


  • To understand the concepts of ‘longer’ and ‘wider’ data frame +formats and be able to convert between them with +tidyr.
  • +

Researchers often want to reshape their data frames from ‘wide’ to +‘longer’ layouts, or vice-versa. The ‘long’ layout or format is +where:

  • each column is a variable
  • +
  • each row is an observation
  • +

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column +for the observed variable and the other columns are ID variables.


For the ‘wide’ format each row is often a site/subject/patient and +you have multiple observation variables containing the same type of +data. These can be either repeated observations over time, or +observation of multiple variables (or a mix of both). You may find data +input may be simpler or some other applications may prefer the ‘wide’ +format. However, many of R‘s functions have been designed +assuming you have ’longer’ formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format.

Diagram illustrating the difference between a wide versus long layout of a data frame

Long and wide data frame layouts mainly affect readability. For +humans, the wide format is often more intuitive since we can often see +more of the data on the screen due to its shape. However, the long +format is more machine readable and is closer to the formatting of +databases. The ID variables in our data frames are similar to the fields +in a database and observed variables are like the database values.


Getting started +


First install the packages if you haven’t already done so (you +probably installed dplyr in the previous lesson):


R +


Load the packages


R +


First, lets look at the structure of our original gapminder data +frame:


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+ +

Challenge 1 +


Is gapminder a purely long, purely wide, or some intermediate +format?

+ +

The original gapminder data.frame is in an intermediate format. It is +not purely long since it had multiple observation variables +(pop,lifeExp,gdpPercap).


Sometimes, as with the gapminder dataset, we have multiple types of +observed data. It is somewhere in between the purely ‘long’ and ‘wide’ +data formats. We have 3 “ID variables” (continent, +country, year) and 3 “Observation variables” +(pop,lifeExp,gdpPercap). This +intermediate format can be preferred despite not having ALL observations +in 1 column given that all 3 observation variables have different units. +There are few operations that would need us to make this data frame any +longer (i.e. 4 ID variables and 1 Observation variable).


While using many of the functions in R, which are often vector based, +you usually do not want to do mathematical operations on values with +different units. For example, using the purely long format, a single +mean for all of the values of population, life expectancy, and GDP would +not be meaningful since it would return the mean of values with 3 +incompatible units. The solution is that we first manipulate the data +either by grouping (see the lesson on dplyr), or we change +the structure of the data frame. Note: Some plotting +functions in R actually work better in the wide format data.


From wide to long format with pivot_longer() +


Until now, we’ve been using the nicely formatted original gapminder +dataset, but ‘real’ data (i.e. our own research data) will never be so +well organized. Here let’s start with the wide formatted version of the +gapminder dataset.


Download the wide version of the gapminder data from here and save it in your data +folder.


We’ll load the data file and look at it. Note: we don’t want our +continent and country columns to be factors, so we use the +stringsAsFactors argument for read.csv() to disable +that.


R +

+gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)


'data.frame':	142 obs. of  38 variables:
+ $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num  3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num  2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
+ $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
Diagram illustrating the wide format of the gapminder data frame

To change this very wide data frame layout back to our nice, +intermediate (or longer) layout, we will use one of the two available +pivot functions from the tidyr package. To +convert from wide to a longer format, we will use the +pivot_longer() function. pivot_longer() makes +datasets longer by increasing the number of rows and decreasing the +number of columns, or ‘lengthening’ your observation variables into a +single variable.

Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format

R +

+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+    names_to = "obstype_year", values_to = "obs_values"
+  )


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
+ $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...

Here we have used piping syntax which is similar to what we were +doing in the previous lesson with dplyr. In fact, these are compatible +and you can use a mix of tidyr and dplyr functions by piping them +together.


We first provide to pivot_longer() a vector of column +names that will be pivoted into longer format. We could type out all the +observation variables, but as in the select() function (see +dplyr lesson), we can use the starts_with() +argument to select all variables that start with the desired character +string. pivot_longer() also allows the alternative syntax +of using the - symbol to identify which variables are not +to be pivoted (i.e. ID variables).


The next arguments to pivot_longer() are +names_to for naming the column that will contain the new ID +variable (obstype_year) and values_to for +naming the new amalgamated observation variable +(obs_value). We supply these new column names as +strings.

Diagram illustrating the long format of the gapminder data

R +

+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(-continent, -country),
+    names_to = "obstype_year", values_to = "obs_values"
+  )


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values  : num [1:5112] 2449 3014 2551 3247 4183 ...

That may seem trivial with this particular data frame, but sometimes +you have 1 ID variable and 40 observation variables with irregular +variable names. The flexibility is a huge time saver!


Now obstype_year actually contains 2 pieces of +information, the observation type +(pop,lifeExp, or gdpPercap) and +the year. We can use the separate() function +to split the character strings into multiple variables


R +

+gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
+gap_long$year <- as.integer(gap_long$year)
+ +

Challenge 2 +


Using gap_long, calculate the mean life expectancy, +population, and gdpPercap for each continent. Hint: use +the group_by() and summarize() functions we +learned in the dplyr lesson

+ +

R +

+gap_long %>% group_by(continent, obs_type) %>%
+   summarize(means=mean(obs_values))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.


# A tibble: 15 × 3
+# Groups:   continent [5]
+   continent obs_type       means
+   <chr>     <chr>          <dbl>
+ 1 Africa    gdpPercap     2194. 
+ 2 Africa    lifeExp         48.9
+ 3 Africa    pop        9916003. 
+ 4 Americas  gdpPercap     7136. 
+ 5 Americas  lifeExp         64.7
+ 6 Americas  pop       24504795. 
+ 7 Asia      gdpPercap     7902. 
+ 8 Asia      lifeExp         60.1
+ 9 Asia      pop       77038722. 
+10 Europe    gdpPercap    14469. 
+11 Europe    lifeExp         71.9
+12 Europe    pop       17169765. 
+13 Oceania   gdpPercap    18622. 
+14 Oceania   lifeExp         74.3
+15 Oceania   pop        8874672. 

From long to intermediate format with pivot_wider() +


It is always good to check work. So, let’s use the second +pivot function, pivot_wider(), to ‘widen’ our +observation variables back out. pivot_wider() is the +opposite of pivot_longer(), making a dataset wider by +increasing the number of columns and decreasing the number of rows. We +can use pivot_wider() to pivot or reshape our +gap_long to the original intermediate format or the widest +format. Let’s start with the intermediate format.


The pivot_wider() function takes names_from +and values_from arguments.


To names_from we supply the column name whose contents +will be pivoted into new output columns in the widened data frame. The +corresponding values will be added from the column named in the +values_from argument.


R +

+gap_normal <- gap_long %>%
+  pivot_wider(names_from = obs_type, values_from = obs_values)


[1] 1704    6

R +



[1] 1704    6

R +



[1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"      

R +



[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

Now we’ve got an intermediate data frame gap_normal with +the same dimensions as the original gapminder, but the +order of the variables is different. Let’s fix that before checking if +they are all.equal().


R +

+gap_normal <- gap_normal[, names(gapminder)]
+all.equal(gap_normal, gapminder)


[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+[3] "Component \"country\": 1704 string mismatches"                                         
+[4] "Component \"pop\": Mean relative difference: 1.634504"                                 
+[5] "Component \"continent\": 1212 string mismatches"                                       
+[6] "Component \"lifeExp\": Mean relative difference: 0.203822"                             
+[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"                           

R +



# A tibble: 6 × 6
+  country  year      pop continent lifeExp gdpPercap
+  <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>
+1 Algeria  1952  9279525 Africa       43.1     2449.
+2 Algeria  1957 10270856 Africa       45.7     3014.
+3 Algeria  1962 11000948 Africa       48.3     2551.
+4 Algeria  1967 12760499 Africa       51.4     3247.
+5 Algeria  1972 14760787 Africa       54.5     4183.
+6 Algeria  1977 17152804 Africa       58.0     4910.

R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134

We’re almost there, the original was sorted by country, +then year.


R +

+gap_normal <- gap_normal %>% arrange(country, year)
+all.equal(gap_normal, gapminder)


[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

That’s great! We’ve gone from the longest format back to the +intermediate and we didn’t introduce any errors in our code.


Now let’s convert the long all the way back to the wide. In the wide +format, we will keep country and continent as ID variables and pivot the +observations across the 3 metrics +(pop,lifeExp,gdpPercap) and time +(year). First we need to create appropriate labels for all +our new variables (time*metric combinations) and we also need to unify +our ID variables to simplify the process of defining +gap_wide.


R +

+gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ var_ID    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ obs_type  : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
+ $ year      : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

R +

+gap_temp <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_")


tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
+ $ ID_var    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

Using unite() we now have a single ID variable which is +a combination of continent,country,and we have +defined variable names. We’re now ready to pipe in +pivot_wider()


R +

+gap_wide_new <- gap_long %>%
+  unite(ID_var, continent, country, sep = "_") %>%
+  unite(var_names, obs_type, year, sep = "_") %>%
+  pivot_wider(names_from = var_names, values_from = obs_values)


tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
+ $ ID_var        : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+ +

Challenge 3 +


Take this 1 step further and create a +gap_ludicrously_wide format data by pivoting over +countries, year and the 3 metrics? Hint this new data +frame should only have 5 rows.

+ +

R +

+gap_ludicrously_wide <- gap_long %>%
+   unite(var_names, obs_type, year, country, sep = "_") %>%
+   pivot_wider(names_from = var_names, values_from = obs_values)

Now we have a great ‘wide’ format data frame, but the +ID_var could be more usable, let’s separate it into 2 +variables with separate()


R +

+gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
+gap_wide_betterID <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_") %>%
+    pivot_wider(names_from = var_names, values_from = obs_values) %>%
+    separate(ID_var, c("continent","country"), sep = "_")


tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
+ $ continent     : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...

R +

+all.equal(gap_wide, gap_wide_betterID)


[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

There and back again!


Other great resources +

+ +

Keypoints +

  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +
+ + +
+ + + diff --git a/instructor/15-knitr-markdown.html b/instructor/15-knitr-markdown.html new file mode 100644 index 000000000..a7c9df326 --- /dev/null +++ b/instructor/15-knitr-markdown.html @@ -0,0 +1,940 @@ + +R for Reproducible Scientific Analysis: Producing Reports With knitr +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Producing Reports With knitr


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 75 minutes

+ +
+ +
+ + + +




  • How can I integrate software and reports?
  • +


  • Understand the value of writing reproducible reports
  • +
  • Learn how to recognise and compile the basic components of an R +Markdown file
  • +
  • Become familiar with R code chunks, and understand their purpose, +structure and options
  • +
  • Demonstrate the use of inline chunks for weaving R outputs into text +blocks, for example when discussing the results of some +calculations
  • +
  • Be aware of alternative output formats to which an R Markdown file +can be exported
  • +

Data analysis reports +


Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their work +for future reference.


Many new users begin by first writing a single R script containing +all of their work, and then share the analysis by emailing the script +and various graphs as attachments. But this can be cumbersome, requiring +a lengthy discussion to explain which attachment was which result.


Writing formal reports with Word or LaTeX can simplify this +process by incorporating both the analysis report and output graphs into +a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy +“whack-a-mole” game of fixing new mistakes resulting from a single +formatting change.


Creating a report as a web page (which is an html file) using R +Markdown makes things easier. The report can be one long stream, so tall +figures that wouldn’t ordinarily fit on one page can be kept at full +size and easier to read, since the reader can simply keep scrolling. +Additionally, the formatting of and R Markdown document is simple and +easy to modify, allowing you to spend more time on your analyses instead +of writing reports.


Literate programming +


Ideally, such analysis reports are reproducible documents: +If an error is discovered, or if some additional subjects are added to +the data, you can just re-compile the report and get the new or +corrected results rather than having to reconstruct figures, paste them +into a Word document, and hand-edit various detailed results.


The key R package here is knitr. It allows you +to create a document that is a mixture of text and chunks of code. When +the document is processed by knitr, chunks of code will be +executed, and graphs or other results will be inserted into the final +document.


This sort of idea has been called “literate programming”.


knitr allows you to mix basically any type of text with +code from different programming languages, but we recommend that you use +R Markdown, which mixes Markdown with R. Markdown is a light-weight +mark-up language for creating web pages.


Creating an R Markdown file +


Within RStudio, click File → New File → R Markdown and you’ll get a +dialog box like this:

Screenshot of the New R Markdown file dialogue box in RStudio

You can stick with the default (HTML output), but give it a +title.


Basic components of R Markdown +


The initial chunk of text (header) contains instructions for R to +specify what kind of document will be created, and the options chosen. +You can use the header to give your document a title, author, date, and +tell it what type of output you want to produce. In this case, we’re +creating an html document.

+title: "Initial R Markdown document"
+author: "Karl Broman"
+date: "April 23, 2015"
+output: html_document

You can delete any of those fields if you don’t want them included. +The double-quotes aren’t strictly necessary in this case. +They’re mostly needed if you want to include a colon in the title.


RStudio creates the document with some example text to get you +started. Note below that there are chunks like


These are chunks of R code that will be executed by +knitr and replaced by their results. More on this +later.


Markdown +


Markdown is a system for writing web pages by marking up the text +much as you would in an email rather than writing html code. The +marked-up text gets converted to html, replacing the marks with +the proper html code.


For now, let’s delete all of the stuff that’s there and write a bit +of markdown.


You make things bold using two asterisks, like this: +**bold**, and you make things italics by using +underscores, like this: _italics_.


You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this:

A list:
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks

or like this:

A second list:
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks

Each will appear as:

  • bold with double-asterisks
  • +
  • italics with underscores
  • +
  • code-type font with backticks
  • +

You can use whatever method you prefer, but be consistent. +This maintains the readability of your code.


You can make a numbered list by just using numbers. You can even use +the same number over and over if you want:

1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks

This will appear as:

  1. bold with double-asterisks
  2. +
  3. italics with underscores
  4. +
  5. code-type font with backticks
  6. +

You can make section headers of different sizes by initiating a line +with some number of # symbols:

# Title
+## Main section
+### Sub-section
+#### Sub-sub section

You compile the R Markdown document to an html webpage by +clicking the “Knit” button in the upper-left.

+ +

Challenge 1 +


Create a new R Markdown document. Delete all of the R code chunks and +write a bit of Markdown (some sections, some italicized text, and an +itemized list).


Convert the document to a webpage.

+ +

In RStudio, select File > New file > R Markdown…


Delete the placeholder text and add the following:

# Introduction
+## Background on Data
+This report uses the *gapminder* dataset, which has columns that include:
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+## Background on Methods

Then click the ‘Knit’ button on the toolbar to generate an html +document (webpage).


A bit more Markdown +


You can make a hyperlink like this: +[Carpentries Home Page](https://carpentries.org/).


You can include an image file like this: +![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)


You can do subscripts (e.g., F2) with F~2~ +and superscripts (e.g., F2) with F^2^.


If you know how to write equations in LaTeX, you can use +$ $ and $$ $$ to insert math equations, like +$E = mc^2$ and

$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$

You can review Markdown syntax by navigating to the “Markdown Quick +Reference” under the “Help” field in the toolbar at the top of +RStudio.


R code chunks +


The real power of Markdown comes from mixing markdown with chunks of +code. This is R Markdown. When processed, the R code will be executed; +if they produce figures, the figures will be inserted in the final +document.


The main code chunks look like this:

+```{r load_data}

That is, you place a chunk of R code between ```{r +chunk_name} and ```. You should give each chunk a +unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the +shortcuts Ctrl+Alt+I on Windows and +Linux, or Cmd+Option+I on Mac.

+ +

Challenge 2 +


Add code chunks to:

  • Load the ggplot2 package
  • +
  • Read the gapminder data
  • +
  • Create a plot
  • +
+ +
+```{r load-ggplot2}
+```{r read-gapminder-data}
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)

How things get compiled +


When you press the “Knit” button, the R Markdown document is +processed by knitr +and a plain Markdown document is produced (as well as, potentially, a +set of figure files): the R code is executed and replaced by both the +input and the output; if figures are produced, links to those figures +are included.


The Markdown and figure documents are then processed by the tool pandoc, which converts the +Markdown file into an html file, with the figures embedded.


Chunk options +


There are a variety of options to affect how the code chunks are +treated. Here are some examples:

  • Use echo=FALSE to avoid having the code itself +shown.
  • +
  • Use results="hide" to avoid having any results +printed.
  • +
  • Use eval=FALSE to have the code shown but not +evaluated.
  • +
  • Use warning=FALSE and message=FALSE to +hide any warnings or messages produced.
  • +
  • Use fig.height and fig.width to control +the size of the figures produced (in inches).
  • +

So you might write:

+```{r load_libraries, echo=FALSE, message=FALSE}

Often there will be particular options that you’ll want to use +repeatedly; for this, you can set global chunk options, like +so:

+```{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)

The fig.path option defines where the figures will be +saved. The / here is really important; without it, the +figures would be saved in the standard place but just with names that +begin with Figs.


If you have multiple R Markdown files in a common directory, you +might want to use fig.path to define separate prefixes for +the figure file names, like fig.path="Figs/cleaning-" and +fig.path="Figs/analysis-".

+ +

Challenge 3 +


Use chunk options to control the size of a figure and to hide the +code.

+ +
+```{r echo = FALSE, fig.width = 3}

You can review all of the R chunk options by navigating +to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the +“Help” field in the toolbar at the top of RStudio.


Inline R code +


You can make every number in your report reproducible. Use +`r and ` for an in-line code chunk, like so: +`r round(some_value, 2)`. The code will be executed and +replaced with the value of the result.


Don’t let these in-line chunks get split across lines.


Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with include=FALSE for +that larger chunk (which is the same as echo=FALSE and +results="hide").


Rounding can produce differences in output in such situations. You +may want 2.0, but round(2.03, 1) will give +just 2.


The myround +function in the R/broman +package handles this.

+ +

Challenge 4 +


Try out a bit of in-line R code.

+ +

Here’s some inline code to determine that 2 + 2 = 4.


Other output options +


You can also convert R Markdown to a PDF or a Word document. Click +the little triangle next to the “Knit” button to get a drop-down menu. +Or you could put pdf_document or word_document +in the initial header of the file.

+ +

Tip: Creating PDF documents +


Creating .pdf documents may require installation of some extra +software. The R package tinytex provides some tools to help +make this process easier for R users. With tinytex +installed, run tinytex::install_tinytex() to install the +required software (you’ll only need to do this once) and then when you +knit to pdf tinytex will automatically detect and install +any additional LaTeX packages that are needed to produce the pdf +document. Visit the tinytex +website for more information.

+ +

Tip: Visual markdown editing in RStudio +


RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.


Resources +

+ +

Keypoints +

  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +
+ + +
+ + + diff --git a/instructor/16-wrap-up.html b/instructor/16-wrap-up.html new file mode 100644 index 000000000..8313c0829 --- /dev/null +++ b/instructor/16-wrap-up.html @@ -0,0 +1,588 @@ + +R for Reproducible Scientific Analysis: Writing Good Software +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Writing Good Software


Last updated on 2023-10-26 | + + Edit this page

+ + + +

Estimated time 15 minutes

+ +
+ +
+ + + +




  • How can I write software that other people can use?
  • +


  • Describe best practices for writing R and explain the justification +for each.
  • +

Structure your project folder +


Keep your project folder structured, organized and tidy, by creating +subfolders for your code files, manuals, data, binaries, output plots, +etc. It can be done completely manually, or with the help of RStudio’s +New Project functionality, or a designated package, such as +ProjectTemplate.

+ +

Tip: ProjectTemplate - a possible +solution +


One way to automate the management of projects is to install the +third-party package, ProjectTemplate. This package will set +up an ideal directory structure for project management. This is very +useful as it enables you to have your analysis pipeline/workflow +organised and structured. Together with the default RStudio project +functionality and Git you will be able to keep track of your work as +well as be able to share your work with collaborators.

  1. Install ProjectTemplate.
  2. +
  3. Load the library
  4. +
  5. Initialise the project:
  6. +

R +

+create.project("../my_project_2", merge.strategy = "allow.non.conflict")

For more information on ProjectTemplate and its functionality visit +the home page ProjectTemplate


Make code readable +


The most important part of writing code is making it readable and +understandable. You want someone else to be able to pick up your code +and be able to understand what it does: more often than not this someone +will be you 6 months down the line, who will otherwise be cursing +past-self.


Documentation: tell us what and why, not how +


When you first start out, your comments will often describe what a +command does, since you’re still learning yourself and it can help to +clarify concepts and remind you later. However, these comments aren’t +particularly useful later on when you don’t remember what problem your +code is trying to solve. Try to also include comments that tell you +why you’re solving a problem, and what problem that +is. The how can come after that: it’s an implementation detail +you ideally shouldn’t have to worry about.


Keep your code modular +


Our recommendation is that you should separate your functions from +your analysis scripts, and store them in a separate file that you +source when you open the R session in your project. This +approach is nice because it leaves you with an uncluttered analysis +script, and a repository of useful functions that can be loaded into any +analysis script in your project. It also lets you group related +functions together easily.


Break down problem into bite size pieces +


When you first start out, problem solving and function writing can be +daunting tasks, and hard to separate from code inexperience. Try to +break down your problem into digestible chunks and worry about the +implementation details later: keep breaking down the problem into +smaller and smaller functions until you reach a point where you can code +a solution, and build back up from there.


Know that your code is doing the right thing +


Make sure to test your functions!


Don’t repeat yourself +


Functions enable easy reuse within a project. If you see blocks of +similar lines of code through your project, those are usually candidates +for being moved into functions.


If your calculations are performed through a series of functions, +then the project becomes more modular and easier to change. This is +especially the case for which a particular input always gives a +particular output.


Remember to be stylish +


Apply consistent style to your code.

+ +

Keypoints +

  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+ + +
+ + + diff --git a/instructor/404.html b/instructor/404.html new file mode 100644 index 000000000..fc2ef6605 --- /dev/null +++ b/instructor/404.html @@ -0,0 +1,451 @@ + +R for Reproducible Scientific Analysis: Page not found +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Page not found

+ +

Our apologies! +


We cannot seem to find the page you are looking for. Here are some +tips that may help:

  1. try going back to the previous +page or
  2. +
  3. navigate to any other page using the navigation bar on the +left.
  4. +
  5. if the URL ends with /index.html, try removing +that.
  6. +
  7. head over to the home page of this +lesson +
  8. +

If you came here from a link in this lesson, please contact the +lesson maintainers using the links at the foot of this page.

+ + +
+ + + diff --git a/instructor/CODE_OF_CONDUCT.html b/instructor/CODE_OF_CONDUCT.html new file mode 100644 index 000000000..2df159c96 --- /dev/null +++ b/instructor/CODE_OF_CONDUCT.html @@ -0,0 +1,451 @@ + +R for Reproducible Scientific Analysis: Contributor Code of Conduct +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Contributor Code of Conduct


Last updated on 2023-10-26 | + + Edit this page

+ + + + + +
+ +
+ + + +

As contributors and maintainers of this project, we pledge to follow +the The +Carpentries Code of Conduct.


Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our reporting +guidelines.

+ + + +
+ + +
+ + + diff --git a/instructor/LICENSE.html b/instructor/LICENSE.html new file mode 100644 index 000000000..3e3bc679a --- /dev/null +++ b/instructor/LICENSE.html @@ -0,0 +1,502 @@ + +R for Reproducible Scientific Analysis: Licenses +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +



Last updated on 2023-10-26 | + + Edit this page

+ + + + + +
+ +
+ + + +

Instructional Material +


All Carpentries (Software Carpentry, Data Carpentry, and Library +Carpentry) instructional material is made available under the Creative Commons +Attribution license. The following is a human-readable summary of +(and not a substitute for) the full legal +text of the CC BY 4.0 license.


You are free:

  • to Share—copy and redistribute the material in any +medium or format
  • +
  • to Adapt—remix, transform, and build upon the +material
  • +

for any purpose, even commercially.


The licensor cannot revoke these freedoms as long as you follow the +license terms.


Under the following terms:

  • Attribution—You must give appropriate credit +(mentioning that your work is derived from work that is Copyright (c) +The Carpentries and, where practical, linking to https://carpentries.org/), provide a link to the +license, and indicate if changes were made. You may do so in any +reasonable manner, but not in any way that suggests the licensor +endorses you or your use.

  • +
  • No additional restrictions—You may not apply +legal terms or technological measures that legally restrict others from +doing anything the license permits. With the understanding +that:

  • +


  • You do not have to comply with the license for elements of the +material in the public domain or where your use is permitted by an +applicable exception or limitation.
  • +
  • No warranties are given. The license may not give you all of the +permissions necessary for your intended use. For example, other rights +such as publicity, privacy, or moral rights may limit how you use the +material.
  • +

Software +


Except where otherwise noted, the example programs and other software +provided by The Carpentries are made available under the OSI-approved MIT +license.


Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +“Software”), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions:


The above copyright notice and this permission notice shall be +included in all copies or substantial portions of the Software.




Trademark +


“The Carpentries”, “Software Carpentry”, “Data Carpentry”, and +“Library Carpentry” and their respective logos are registered trademarks +of Community Initiatives.

+ + +
+ + + diff --git a/instructor/aio.html b/instructor/aio.html new file mode 100644 index 000000000..fcb0086f6 --- /dev/null +++ b/instructor/aio.html @@ -0,0 +1,12669 @@ + + + + + +R for Reproducible Scientific Analysis: All in One View + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + + +
+ + +

Content from Introduction to R and RStudio


Last updated on 2023-10-26 | + + Edit this page


Estimated time 55 minutes

+ +




  • How to find your way around RStudio?
  • +
  • How to interact with R?
  • +
  • How to manage your environment?
  • +
  • How to install packages?
  • +


  • Describe the purpose and use of each pane in the RStudio IDE
  • +
  • Locate buttons and options in the RStudio IDE
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Manage a workspace in an interactive R session
  • +
  • Use mathematical and comparison operators
  • +
  • Call functions
  • +
  • Manage packages
  • +

Motivation +


Science is a multi-step process: once you’ve designed an experiment +and collected data, the real fun begins! This lesson will teach you how +to start this process using R and RStudio. We will begin with raw data, +perform exploratory analyses, and learn how to plot results graphically. +This example starts with a dataset from gapminder.org containing population +information for many countries through time. Can you read the data into +R? Can you plot the population for Senegal? Can you calculate the +average income for countries on the continent of Asia? By the end of +these lessons you will be able to do things like plot the populations +for all of these countries in under a minute!


Before Starting The Workshop +


Please ensure you have the latest version of R and RStudio installed +on your machine. This is important, as some packages used in the +workshop may not install correctly (or at all) if R is not up to +date.


Introduction to RStudio +


Welcome to the R portion of the Software Carpentry workshop.


Throughout this lesson, we’re going to teach you some of the +fundamentals of the R language as well as some best practices for +organizing code for scientific projects that will make your life +easier.


We’ll be using RStudio: a free, open-source R Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms +(including on servers) and provides many advantages such as integration +with version control and project management.


Basic layout


When you first open RStudio, you will be greeted by three panels:

  • The interactive R console/Terminal (entire left)
  • +
  • Environment/History/Connections (tabbed in upper right)
  • +
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • +
RStudio layout

Once you open files, such as R scripts, an editor panel will also +open in the top left.

RStudio layout with .R file open
+ +

R scripts +


Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have .R at the end of their +names to let you know what they are.


Workflow within RStudio +


There are two main ways one can work within RStudio:

  1. Test and play within the interactive R console then copy code into a +.R file to run later.
  2. +
  • This works well when doing small tests and initially starting +off.
  • +
  • It quickly becomes laborious
  • +
  1. Start writing in a .R file and use RStudio’s short cut keys for the +Run command to push the current line, selected lines or modified lines +to the interactive R console.
  2. +
  • This is a great way to start; all your code is saved for later
  • +
  • You will be able to run the file you create from within RStudio or +using R’s source() function.
  • +
+ +

Tip: Running segments of your code +


RStudio offers you great flexibility in running code from within the +editor window. There are buttons, menu choices, and keyboard shortcuts. +To run the current line, you can

  1. click on the Run button above the editor panel, or
  2. +
  3. select “Run Lines” from the “Code” menu, or
  4. +
  5. hit Ctrl+Return in Windows or Linux or ++Return on OS X. (This shortcut can also be seen +by hovering the mouse over the button). To run a block of code, select +it and then Run. If you have modified a line of code within +a block of code you have just run, there is no need to reselect the +section and Run, you can use the next button along, +Re-run the previous region. This will run the previous code +block including the modifications you have made.
  6. +

Introduction to R +


Much of your time in R will be spent in the R interactive console. +This is where you will run all of your code, and can be a useful +environment to try out ideas before adding them to an R script file. +This console in RStudio is the same as the one you would get if you +typed in R in your command-line environment.


The first thing you will see in the R interactive session is a bunch +of information, followed by a “>” and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a “Read, evaluate, print +loop”: you type in commands, R tries to execute them, and then returns a +result.


Using R as a calculator +


The simplest thing you could do with R is to do arithmetic:


R +

+1 + 100


[1] 101

And R will print out the answer, with a preceding “[1]”. [1] is the +index of the first element of the line being printed in the console. For +more information on indexing vectors, see Episode +6: Subsetting Data.


If you type in an incomplete command, R will wait for you to complete +it. If you are familiar with Unix Shell’s bash, you may recognize +this
+behavior from bash.


R +

> 1 +



Any time you hit return and the R session shows a “+” instead of a +“>”, it means it’s waiting for you to complete the command. If you +want to cancel a command you can hit Esc and RStudio will +give you back the “>” prompt.

+ +

Tip: Canceling commands +


If you’re using R from the command line instead of from within +RStudio, you need to use Ctrl+C instead of +Esc to cancel the command. This applies to Mac users as +well!


Canceling a command isn’t only useful for killing incomplete +commands: you can also use it to tell R to stop running code (for +example if it’s taking much longer than you expect), or to get rid of +the code you’re currently writing.


When using R as a calculator, the order of operations is the same as +you would have learned back in school.


From highest to lowest precedence:

  • Parentheses: (, ) +
  • +
  • Exponents: ^ or ** +
  • +
  • Multiply: * +
  • +
  • Divide: / +
  • +
  • Add: + +
  • +
  • Subtract: - +
  • +

R +

+3 + 5 * 2


[1] 13

Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend.


R +

+(3 + 5) * 2


[1] 16

This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code.


R +

+(3 + (5 * (2 ^ 2))) # hard to read
+3 + 5 * 2 ^ 2       # clear, if you remember the rules
+3 + 5 * (2 ^ 2)     # if you forget some rules, this might help

The text after each line of code is called a “comment”. Anything that +follows after the hash (or octothorpe) symbol # is ignored +by R when it executes code.


Really small or large numbers get a scientific notation:


R +



[1] 2e-04

Which is shorthand for “multiplied by 10^XX”. So +2e-4 is shorthand for 2 * 10^(-4).


You can write numbers in scientific notation too:


R +

+5e3  # Note the lack of minus here


[1] 5000

Mathematical functions +


R has many built in mathematical functions. To call a function, we +can type its name, followed by open and closing parentheses. Functions +take arguments as inputs, anything we type inside the parentheses of a +function is considered an argument. Depending on the function, the +number of arguments can vary from none to multiple. For example:


R +

+getwd() #returns an absolute filepath

doesn’t require an argument, whereas for the next set of mathematical +functions we will need to supply the function a value in order to +compute the result.


R +

+sin(1)  # trigonometry functions


[1] 0.841471

R +

+log(1)  # natural logarithm


[1] 0

R +

+log10(10) # base-10 logarithm


[1] 1

R +

+exp(0.5) # e^(1/2)


[1] 1.648721

Don’t worry about trying to remember every function in R. You can +look them up on Google, or if you can remember the start of the +function’s name, use the tab completion in RStudio.


This is one advantage that RStudio has over R on its own, it has +auto-completion abilities that allow you to more easily look up +functions, their arguments, and the values that they take.


Typing a ? before the name of a command will open the +help page for that command. When using RStudio, this will open the +‘Help’ pane; if using R in the terminal, the help page will open in your +browser. The help page will include a detailed description of the +command and how it works. Scrolling to the bottom of the help page will +usually show a collection of code examples which illustrate command +usage. We’ll go through an example later.


Comparing things +


We can also do comparisons in R:


R +

+1 == 1  # equality (note two equals signs, read as "is equal to")


[1] TRUE

R +

+1 != 2  # inequality (read as "is not equal to")


[1] TRUE

R +

+1 < 2  # less than


[1] TRUE

R +

+1 <= 1  # less than or equal to


[1] TRUE

R +

+1 > 0  # greater than


[1] TRUE

R +

+1 >= -9 # greater than or equal to


[1] TRUE
+ +

Tip: Comparing Numbers +


A word of warning about comparing numbers: you should never use +== to compare two numbers unless they are integers (a data +type which can specifically represent only whole numbers).


Computers may only represent decimal numbers with a certain degree of +precision, so two numbers which look the same when printed out by R, may +actually have different underlying representations and therefore be +different by a small margin of error (called Machine numeric +tolerance).


Instead you should use the all.equal function.


Further reading: http://floating-point-gui.de/


Variables and assignment +


We can store values in variables using the assignment operator +<-, like this:


R +

+x <- 1/40

Notice that assignment does not print a value. Instead, we stored it +for later in something called a variable. +x now contains the value +0.025:


R +



[1] 0.025

More precisely, the stored value is a decimal approximation +of this fraction called a floating point +number.


Look for the Environment tab in the top right panel of +RStudio, and you will see that x and its value have +appeared. Our variable x can be used in place of a number +in any calculation that expects a number:


R +



[1] -3.688879

Notice also that variables can be reassigned:


R +

+x <- 100

x used to contain the value 0.025 and now it has the +value 100.


Assignment values can contain the variable being assigned to:


R +

+x <- x + 1 #notice how RStudio updates its description of x on the top right tab
+y <- x * 2

The right hand side of the assignment can be any valid R expression. +The right hand side is fully evaluated before the assignment +occurs.


Variable names can contain letters, numbers, underscores and periods +but no spaces. They must start with a letter or a period followed by a +letter (they cannot start with a number nor an underscore). Variables +beginning with a period are hidden variables. Different people use +different conventions for long variable names, these include

  • periods.between.words
  • +
  • underscores_between_words
  • +
  • camelCaseToSeparateWords
  • +

What you use is up to you, but be consistent.


It is also possible to use the = operator for +assignment:


R +

+x = 1/40

But this is much less common among R users. The most important thing +is to be consistent with the operator you use. There +are occasionally places where it is less confusing to use +<- than =, and it is the most common symbol +used in the community. So the recommendation is to use +<-.

+ +

Challenge 1 +


Which of the following are valid R variable names?


R +

+ +

The following can be used as R variables:


R +


The following creates a hidden variable:


R +


The following will not be able to be used to create a variable


R +


Vectorization +


One final thing to be aware of is that R is vectorized, +meaning that variables and functions can have vectors as values. In +contrast to physics and mathematics, a vector in R describes a set of +values in a certain order of the same data type. For example


R +



[1] 1 2 3 4 5

R +



[1]  2  4  8 16 32

R +

+x <- 1:5


[1]  2  4  8 16 32

This is incredibly powerful; we will discuss this further in an +upcoming lesson.


Managing your environment +


There are a few useful commands you can use to interact with the R +session.


ls will list all of the variables and functions stored +in the global environment (your working R session):


R +



[1] "x" "y"
+ +

Tip: hidden objects +


Like in the shell, ls will hide any variables or +functions starting with a “.” by default. To list all objects, type +ls(all.names=TRUE) instead


Note here that we didn’t give any arguments to ls, but +we still needed to give the parentheses to tell R to call the +function.


If we type ls by itself, R prints a bunch of code +instead of a listing of objects.


R +



function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
+    pattern, sorted = TRUE) 
+    if (!missing(name)) {
+        pos <- tryCatch(name, error = function(e) e)
+        if (inherits(pos, "error")) {
+            name <- substitute(name)
+            if (!is.character(name)) 
+                name <- deparse(name)
+            warning(gettextf("%s converted to character string", 
+                sQuote(name)), domain = NA)
+            pos <- name
+        }
+    }
+    all.names <- .Internal(ls(envir, all.names, sorted))
+    if (!missing(pattern)) {
+        if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
+            ll != length(grep("]", pattern, fixed = TRUE))) {
+            if (pattern == "[") {
+                pattern <- "\\["
+                warning("replaced regular expression pattern '[' by  '\\\\['")
+            }
+            else if (length(grep("[^\\\\]\\[<-", pattern))) {
+                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+            }
+        }
+        grep(pattern, all.names, value = TRUE)
+    }
+    else all.names
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>

What’s going on here?


Like everything in R, ls is the name of an object, and +entering the name of an object by itself prints the contents of the +object. The object x that we created earlier contains 1, 2, +3, 4, 5:


R +



[1] 1 2 3 4 5

The object ls contains the R code that makes the +ls function work! We’ll talk more about how functions work +and start writing our own later.


You can use rm to delete objects you no longer need:


R +


If you have lots of things in your environment and want to delete all +of them, you can pass the results of ls to the +rm function:


R +

+rm(list = ls())

In this case we’ve combined the two. Like the order of operations, +anything inside the innermost parentheses is evaluated first, and so +on.


In this case we’ve specified that the results of ls +should be used for the list argument in rm. +When assigning values to arguments by name, you must use the += operator!!


If instead we use <-, there will be unintended side +effects, or you may get an error message:


R +

+rm(list <- ls())


Error in rm(list <- ls()): ... must contain names or character strings
+ +

Tip: Warnings vs. Errors +


Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn’t worked as expected.


In both cases, the message that R prints out usually give you clues +how to fix a problem.


R Packages +


It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages:

  • You can see what packages are installed by typing +installed.packages() +
  • +
  • You can install packages by typing +install.packages("packagename"), where +packagename is the package name, in quotes.
  • +
  • You can update installed packages by typing +update.packages() +
  • +
  • You can remove a package with +remove.packages("packagename") +
  • +
  • You can make a package available for use with +library(packagename) +
  • +

Packages can also be viewed, loaded, and detached in the Packages tab +of the lower right panel in RStudio. Clicking on this tab will display +all of the installed packages with a checkbox next to them. If the box +next to a package name is checked, the package is loaded and if it is +empty, the package is not loaded. Click an empty box to load that +package and click a checked box to detach that package.


Packages can be installed and updated from the Package tab with the +Install and Update buttons at the top of the tab.

+ +

Challenge 2 +


What will be the value of each variable after each statement in the +following program?


R +

+mass <- 47.5
+age <- 122
+mass <- mass * 2.3
+age <- age - 20
+ +

R +

+mass <- 47.5

This will give a value of 47.5 for the variable mass


R +

+age <- 122

This will give a value of 122 for the variable age


R +

+mass <- mass * 2.3

This will multiply the existing value of 47.5 by 2.3 to give a new +value of 109.25 to the variable mass.


R +

+age <- age - 20

This will subtract 20 from the existing value of 122 to give a new +value of 102 to the variable age.

+ +

Challenge 3 +


Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age?

+ +

One way of answering this question in R is to use the +> to set up the following:


R +

+mass > age


[1] TRUE

This should yield a boolean value of TRUE since 109.25 is greater +than 102.

+ +

Challenge 4 +


Clean up your working environment by deleting the mass and age +variables.

+ +

We can use the rm command to accomplish this task


R +

+rm(age, mass)
+ +

Challenge 5 +


Install the following packages: ggplot2, +plyr, gapminder

+ +

We can use the install.packages() command to install the +required packages.


R +


An alternate solution, to install multiple packages with a single +install.packages() command is:


R +

+install.packages(c("ggplot2", "plyr", "gapminder"))
+ +

Keypoints +

  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +

Content from Project Management With RStudio


Last updated on 2023-10-26 | + + Edit this page


Estimated time 30 minutes

+ +




  • How can I manage my projects in R?
  • +


  • Create self-contained projects in RStudio
  • +

Introduction +


The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and eventually +everything is a bit mixed together.

+ +

Most people tend to organize their projects like this:

Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

  1. It is really hard to tell which version of your data is the original +and which is the modified;
  2. +
  3. It gets really messy because it mixes files with various extensions +together;
  4. +
  5. It probably takes you a lot of time to actually find things, and +relate the correct figures to the exact code that has been used to +generate it;
  6. +

A good project layout will ultimately make your life easier:

  • It will help ensure the integrity of your data;
  • +
  • It makes it simpler to share your code with someone else (a +lab-mate, collaborator, or supervisor);
  • +
  • It allows you to easily upload your code with your manuscript +submission;
  • +
  • It makes it easier to pick the project back up after a break.
  • +

A possible solution +


Fortunately, there are tools and packages which can help you manage +your work effectively.


One of the most powerful and useful aspects of RStudio is its project +management functionality. We’ll be using this today to create a +self-contained, reproducible project.

+ +

Challenge 1: Creating a self-contained +project +


We’re going to create a new project in RStudio:

  1. Click the “File” menu button, then “New Project”.
  2. +
  3. Click “New Directory”.
  4. +
  5. Click “New Project”.
  6. +
  7. Type in the name of the directory to store your project, +e.g. “my_project”.
  8. +
  9. If available, select the checkbox for “Create a git +repository.”
  10. +
  11. Click the “Create Project” button.
  12. +

The simplest way to open an RStudio project once it has been created +is to click through your file system to get to the directory where it +was saved and double click on the .Rproj file. This will +open RStudio and start your R session in the same directory as the +.Rproj file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added +benefit of allowing you to open multiple projects at the same time each +open to its own project directory. This allows you to keep multiple +projects open without them interfering with each other.

+ +

Challenge 2: Opening an RStudio project +through the file system +

  1. Exit RStudio.
  2. +
  3. Navigate to the directory where you created a project in Challenge +1.
  4. +
  5. Double click on the .Rproj file in that directory.
  6. +

Best practices for project organization +


Although there is no “best” way to lay out a project, there are some +general principles to adhere to that will make project management +easier:


Treat data as read only +


This is probably the most important goal of setting up a project. +Data is typically time consuming and/or expensive to collect. Working +with them interactively (e.g., in Excel) where they can be modified +means you are never sure of where the data came from, or how it has been +modified since collection. It is therefore a good idea to treat your +data as “read-only”.


Data Cleaning +


In many cases your data will be “dirty”: it will need significant +preprocessing to get into a format R (or any other programming language) +will find useful. This task is sometimes called “data munging”. Storing +these scripts in a separate folder, and creating a second “read-only” +data folder to hold the “cleaned” data sets can prevent confusion +between the two sets.


Treat generated output as disposable +


Anything generated by your scripts should be treated as disposable: +it should all be able to be regenerated from your scripts.


There are lots of different ways to manage this output. Having an +output folder with different sub-directories for each separate analysis +makes it easier later. Since many analyses are exploratory and don’t end +up being used in the final project, and some of the analyses get shared +between projects.

+ +

Tip: Good Enough Practices for Scientific +Computing +


Good +Enough Practices for Scientific Computing gives the following +recommendations for project organization:

  1. Put each project in its own directory, which is named after the +project.
  2. +
  3. Put text documents associated with the project in the +doc directory.
  4. +
  5. Put raw data and metadata in the data directory, and +files generated during cleanup and analysis in a results +directory.
  6. +
  7. Put source for the project’s scripts and programs in the +src directory, and programs brought in from elsewhere or +compiled locally in the bin directory.
  8. +
  9. Name all files to reflect their content or function.
  10. +

Separate function definition and application +


One of the more effective ways to work with R is to start by writing +the code you want to run directly in a .R script, and then running the +selected lines (either using the keyboard shortcuts in RStudio or +clicking the “Run” button) in the interactive R console.


When your project is in its early stages, the initial .R script file +usually contains many lines of directly executed code. As it matures, +reusable chunks get pulled into their own functions. It’s a good idea to +separate these functions into two separate folders; one to store useful +functions that you’ll reuse across analyses and projects, and one to +store the analysis scripts.


Save the data in the data directory +


Now we have a good directory structure we will now place/save the +data file in the data/ directory.

+ +

Challenge 3 +


Download the gapminder data from here.

  1. Download the file (right mouse click on the link above -> “Save +link as” / “Save file as”, or click on the link and after the page +loads, press Ctrl+S or choose File -> “Save +page as”)
  2. +
  3. Make sure it’s saved under the name +gapminder_data.csv +
  4. +
  5. Save the file in the data/ folder within your +project.
  6. +

We will load and inspect these data later.

+ +

Challenge 4 +


It is useful to get some general idea about the dataset, directly +from the command line, before loading it into R. Understanding the +dataset better will come in handy when making decisions on how to load +it in R. Use the command-line shell to answer the following +questions:

  1. What is the size of the file?
  2. +
  3. How many rows of data does it contain?
  4. +
  5. What kinds of values are stored in this file?
  6. +
+ +

By running these commands in the shell:


SH +

ls -lh data/gapminder_data.csv


-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv

The file size is 80K.


SH +

wc -l data/gapminder_data.csv


1705 data/gapminder_data.csv

There are 1705 lines. The data looks like:


SH +

head data/gapminder_data.csv


+ +

Tip: command line in RStudio +


The Terminal tab in the console pane provides a convenient place +directly within RStudio to interact directly with the command line.


Working directory +


Knowing R’s current working directory is important because when you +need to access other files (for example, to import a data file), R will +look for them relative to the current working directory.


Each time you create a new RStudio Project, it will create a new +directory for that project. When you open an existing +.Rproj file, it will open that project and set R’s working +directory to the folder that file is in.

+ +

Challenge 5 +


You can check the current working directory with the +getwd() command, or by using the menus in RStudio.

  1. In the console, type getwd() (“wd” is short for +“working directory”) and hit Enter.
  2. +
  3. In the Files pane, double click on the data folder to +open it (or navigate to any other folder you wish). To get the Files +pane back to the current working directory, click “More” and then select +“Go To Working Directory”.
  4. +

You can change the working directory with setwd(), or by +using RStudio menus.

  1. In the console, type setwd("data") and hit Enter. Type +getwd() and hit Enter to see the new working +directory.
  2. +
  3. In the menus at the top of the RStudio window, click the “Session” +menu button, and then select “Set Working Directory” and then “Choose +Directory”. Next, in the windows navigator that opens, navigate back to +the project directory, and click “Open”. Note that a setwd +command will automatically appear in the console.
  4. +
+ +

Tip: File does not exist errors +


When you’re attempting to reference a file in your R code and you’re +getting errors saying the file doesn’t exist, it’s a good idea to check +your working directory. You need to either provide an absolute path to +the file, or you need to make sure the file is saved in the working +directory (or a subfolder of the working directory) and provide a +relative path.


Version Control +


It is important to use version control with projects. Go here +for a good lesson which describes using Git with RStudio.

+ +

Keypoints +

  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +

Content from Seeking Help


Last updated on 2023-10-26 | + + Edit this page


Estimated time 20 minutes

+ +




  • How can I get help in R?
  • +


  • To be able to read R help files for functions and special +operators.
  • +
  • To be able to use CRAN task views to identify packages to solve a +problem.
  • +
  • To be able to seek help from your peers.
  • +

Reading Help Files +


R, and every package, provide help files for functions. The general +syntax to search for help on any function, “function_name”, from a +specific function that is in a package loaded into your namespace (your +interactive R session) is:


R +


For example take a look at the help file for +write.table(), we will be using a similar function in an +upcoming episode.


R +


This will load up a help page in RStudio (or as plain text in R +itself).


Each help page is broken down into sections:

  • Description: An extended description of what the function does.
  • +
  • Usage: The arguments of the function and their default values (which +can be changed).
  • +
  • Arguments: An explanation of the data each argument is +expecting.
  • +
  • Details: Any important details to be aware of.
  • +
  • Value: The data the function returns.
  • +
  • See Also: Any related functions you might find useful.
  • +
  • Examples: Some examples for how to use the function.
  • +

Different functions might have different sections, but these are the +main ones you should be aware of.


Notice how related functions might call for the same help file:


R +


This is because these functions have very similar applicability and +often share the same arguments as inputs to the function, so package +authors often choose to document them together in a single help +file.

+ +

Tip: Running Examples +


From within the function help page, you can highlight code in the +Examples and hit Ctrl+Return to run it in RStudio +console. This gives you a quick way to get a feel for how a function +works.

+ +

Tip: Reading Help Files +


One of the most daunting aspects of R is the large number of +functions available. It would be prohibitive, if not impossible to +remember the correct usage for every function you use. Luckily, using +the help files means you don’t have to remember that!


Special Operators +


To seek help on special operators, use quotes or backticks:


R +


Getting Help with Packages +


Many packages come with “vignettes”: tutorials and extended example +documentation. Without any arguments, vignette() will list +all vignettes for all installed packages; +vignette(package="package-name") will list all available +vignettes for package-name, and +vignette("vignette-name") will open the specified +vignette.


If a package doesn’t have any vignettes, you can usually find help by +typing help("package-name").


RStudio also has a set of excellent cheatsheets for +many packages.


When You Remember Part of the Function Name +


If you’re not sure what package a function is in or how it’s +specifically spelled, you can do a fuzzy search:


R +


A fuzzy search is when you search for an approximate string match. +For example, you may remember that the function to set your working +directory includes “set” in its name. You can do a fuzzy search to help +you identify the function:


R +


When You Have No Idea Where to Begin +


If you don’t know what function or package you need to use CRAN Task Views is a +specially maintained list of packages grouped into fields. This can be a +good starting point.


When Your Code Doesn’t Work: Seeking Help from Your Peers +


If you’re having trouble using a function, 9 times out of 10, the +answers you seek have already been answered on Stack Overflow. You can search +using the [r] tag. Please make sure to see their page on how to ask a good +question.


If you can’t find the answer, there are a few useful functions to +help you ask your peers:


R +


Will dump the data you’re working with into a format that can be +copied and pasted by others into their own R session.


R +



R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+Matrix products: default
+BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+ [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
+ [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
+time zone: UTC
+tzcode source: system (glibc)
+attached base packages:
+[1] stats     graphics  grDevices utils     datasets  methods   base     
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0 yaml_2.3.7       
+[5] knitr_1.43        xfun_0.40         renv_1.0.3        evaluate_0.21    

Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue.

+ +

Challenge 1 +


Look at the help page for the c function. What kind of +vector do you expect will be created if you evaluate the following:


R +

+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+ +

The c() function creates a vector, in which all elements +are of the same type. In the first case, the elements are numeric, in +the second, they are characters, and in the third they are also +characters: the numeric values are “coerced” to be characters.

+ +

Challenge 2 +


Look at the help for the paste function. You will need +to use it later. What’s the difference between the sep and +collapse arguments?

+ +

To look at the help for the paste() function, use:


R +


The difference between sep and collapse is +a little tricky. The paste function accepts any number of +arguments, each of which can be a vector of any length. The +sep argument specifies the string used between concatenated +terms — by default, a space. The result is a vector as long as the +longest argument supplied to paste. In contrast, +collapse specifies that after concatenation the elements +are collapsed together using the given separator, the result +being a single string.


It is important to call the arguments explicitly by typing out the +argument name e.g sep = "," so the function understands to +use the “,” as a separator and not a term to concatenate. e.g.


R +

+paste(c("a","b"), "c")


[1] "a c" "b c"

R +

+paste(c("a","b"), "c", ",")


[1] "a c ," "b c ,"

R +

+paste(c("a","b"), "c", sep = ",")


[1] "a,c" "b,c"

R +

+paste(c("a","b"), "c", collapse = "|")


[1] "a c|b c"

R +

+paste(c("a","b"), "c", sep = ",", collapse = "|")


[1] "a,c|b,c"

(For more information, scroll to the bottom of the +?paste help page and look at the examples, or try +example('paste').)

+ +

Challenge 3 +


Use help to find a function (and its associated parameters) that you +could use to load data from a tabular file in which columns are +delimited with “\t” (tab) and the decimal point is a “.” (period). This +check for decimal separator is important, especially if you are working +with international colleagues, because different countries have +different conventions for the decimal point (i.e. comma vs period). +Hint: use ??"read table" to look up functions related to +reading in tabular data.

+ +

The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +read.table(file, sep="\t") (the period is the +default decimal separator for read.table()), +although you may have to change the comment.char argument +as well if your data file contains hash (#) characters.


Other Resources +

+ +
+ +

Keypoints +

  • Use help() to get online help in R.
  • +

Content from Data Structures


Last updated on 2023-10-26 | + + Edit this page


Estimated time 55 minutes

+ +




  • How can I read data in R?
  • +
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +


  • To be able to identify the 5 main data types.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
  • To understand the information of the attributes “names”, “class”, +and “dim”.
  • +

One of R’s most powerful features is its ability to deal with tabular +data - such as you may already have in a spreadsheet or a CSV file. +Let’s start by making a toy dataset in your data/ +directory, called feline-data.csv:


R +

+cats <- data.frame(coat = c("calico", "black", "tabby"),
+                    weight = c(2.1, 5.0, 3.2),
+                    likes_string = c(1, 0, 1))

We can now save cats as a CSV file. It is good practice +to call the argument names explicitly so the function knows what default +values you are changing. Here we are setting +row.names = FALSE. Recall you can use +?write.csv to pull up the help file to check out the +argument names and their default values.


R +

+write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)

The contents of the new file, feline-data.csv:


R +

+ +

Tip: Editing Text files in R +


Alternatively, you can create data/feline-data.csv using +a text editor (Nano), or within RStudio with the File -> New +File -> Text File menu item.


We can load this into R via the following:


R +

+cats <- read.csv(file = "data/feline-data.csv")


    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1

The read.table function is used for reading in tabular +data stored in a text file where the columns of data are separated by +punctuation characters such as CSV files (csv = comma-separated values). +Tabs and commas are the most common punctuation characters used to +separate or delimit data points in csv files. For convenience R provides +2 other versions of read.table. These are: +read.csv for files where the data are separated with commas +and read.delim for files where the data are separated with +tabs. Of these three functions read.csv is the most +commonly used. If needed it is possible to override the default +delimiting punctuation marks for both read.csv and +read.delim.

+ +

Check your data for factors +


In recent times, the default way how R handles textual data has +changed. Text data was interpreted by R automatically into a format +called “factors”. But there is an easier format that is called +“character”. We will hear about factors later, and what to use them for. +For now, remember that in most cases, they are not needed and only +complicate your life, which is why newer R versions read in text as +“character”. Check now if your version of R has automatically created +factors and convert them to “character” format:

  1. Check the data types of your input by typing +str(cats) +
  2. +
  3. In the output, look at the three-letter codes after the colons: If +you see only “num” and “chr”, you can continue with the lesson and skip +this box. If you find “fct”, continue to step 3.
  4. +
  5. Prevent R from automatically creating “factor” data. That can be +done by the following code: +options(stringsAsFactors = FALSE). Then, re-read the cats +table for the change to take effect.
  6. +
  7. You must set this option every time you restart R. To not forget +this, include it in your analysis script before you read in any data, +for example in one of the first lines.
  8. +
  9. For R versions greater than 4.0.0, text data is no longer converted +to factors anymore. So you can install this or a newer version to avoid +this problem. If you are working on an institute or company computer, +ask your administrator to do it.
  10. +

We can begin exploring our dataset right away, pulling out columns by +specifying them using the $ operator:


R +



[1] 2.1 5.0 3.2

R +



[1] "calico" "black"  "tabby" 

We can do other operations on the columns:


R +

+## Say we discovered that the scale weighs two Kg light:
+cats$weight + 2


[1] 4.1 7.0 5.2

R +

+paste("My cat is", cats$coat)


[1] "My cat is calico" "My cat is black"  "My cat is tabby" 

But what about


R +

+cats$weight + cats$coat


Error in cats$weight + cats$coat: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing +data in R.


Data Types +


If you guessed that the last command will return an error because +2.1 plus "black" is nonsense, you’re right - +and you already have some intuition for an important concept in +programming called data types. We can ask what type of data +something is:


R +



[1] "double"

There are 5 main types: double, integer, +complex, logical and character. +For historic reasons, double is also called +numeric.


R +



[1] "double"

R +

+typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers


[1] "integer"

R +



[1] "complex"

R +



[1] "logical"

R +



[1] "character"

No matter how complicated our analyses become, all data in R is +interpreted as one of these basic data types. This strictness has some +really important consequences.


A user has added details of another cat. This information is in the +file data/feline-data_v2.csv.


R +


R +

+tabby,2.3 or 2.4,1

Load the new cats data like before, and check what type of data we +find in the weight column:


R +

+cats <- read.csv(file="data/feline-data_v2.csv")


[1] "character"

Oh no, our weights aren’t the double type anymore! If we try to do +the same math we did on them before, we run into trouble:


R +

+cats$weight + 2


Error in cats$weight + 2: non-numeric argument to binary operator

What happened? The cats data we are working with is +something called a data frame. Data frames are one of the most +common and versatile types of data structures we will work with +in R. A given column in a data frame cannot be composed of different +data types. In this case, R does not read everything in the data frame +column weight as a double, therefore the entire +column data type changes to something that is suitable for everything in +the column.


When R reads a csv file, it reads it in as a data frame. +Thus, when we loaded the cats csv file, it is stored as a +data frame. We can recognize data frames by the first row that is +written by the str() function:


R +



'data.frame':	4 obs. of  3 variables:
+ $ coat        : chr  "calico" "black" "tabby" "tabby"
+ $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
+ $ likes_string: int  1 0 1 1

Data frames are composed of rows and columns, where each +column has the same number of rows. Different columns in a data frame +can be made up of different data types (this is what makes them so +versatile), but everything in a given column needs to be the same type +(e.g., vector, factor, or list).


Let’s explore more about different data structures and how they +behave. For now, let’s remove that extra line from our cats data and +reload it, while we investigate this behavior further:




And back in RStudio:


R +

+cats <- read.csv(file="data/feline-data.csv")

Vectors and Type Coercion +


To better understand this behavior, let’s meet another of the data +structures: the vector.


R +

+my_vector <- vector(length = 3)



A vector in R is essentially an ordered list of things, with the +special condition that everything in the vector must be the same +basic data type. If you don’t choose the datatype, it’ll default to +logical; or, you can declare an empty vector of whatever +type you like.


R +

+another_vector <- vector(mode='character', length=3)


[1] "" "" ""

You can check if something is a vector:


R +



 chr [1:3] "" "" ""

The somewhat cryptic output from this command indicates the basic +data type found in this vector - in this case chr, +character; an indication of the number of things in the vector - +actually, the indexes of the vector, in this case [1:3]; +and a few examples of what’s actually in the vector - in this case empty +character strings. If we similarly do


R +



 num [1:3] 2.1 5 3.2

we see that cats$weight is a vector, too - the +columns of data we load into R data.frames are all vectors, and +that’s the root of why R forces everything in a column to be the same +basic data type.

+ +

Discussion 1 +


Why is R so opinionated about what we put in our columns of data? How +does this help us?

+ +

By keeping everything in a column the same, we allow ourselves to +make simple assumptions about our data; if you can interpret one entry +in the column as a number, then you can interpret all of them +as numbers, so we don’t have to check every time. This consistency is +what people mean when they talk about clean data; in the long +run, strict consistency goes a long way to making our lives easier in +R.


Coercion by combining vectors +


You can also make vectors with explicit contents with the combine +function:


R +

+combine_vector <- c(2,6,3)


[1] 2 6 3

Given what we’ve learned so far, what do you think the following will +produce?


R +

+quiz_vector <- c(2,6,'3')

This is something called type coercion, and it is the source +of many surprises and the reason why we need to be aware of the basic +data types and how R will interpret them. When R encounters a mix of +types (here double and character) to be combined into a single vector, +it will force them all to be the same type. Consider:


R +

+coercion_vector <- c('a', TRUE)


[1] "a"    "TRUE"

R +

+another_coercion_vector <- c(0, TRUE)


[1] 0 1

The type hierarchy +


The coercion rules go: logical -> +integer -> double (“numeric”) +-> complex -> character, where -> can +be read as are transformed into. For example, combining +logical and character transforms the result to +character:


R +

+c('a', TRUE)


[1] "a"    "TRUE"

A quick way to recognize character vectors is by the +quotes that enclose them when they are printed.


You can try to force coercion against this flow using the +as. functions:


R +

+character_vector_example <- c('0','2','4')


[1] "0" "2" "4"

R +

+character_coerced_to_double <- as.double(character_vector_example)


[1] 0 2 4

R +

+double_coerced_to_logical <- as.logical(character_coerced_to_double)



As you can see, some surprising things can happen when R forces one +basic data type into another! Nitty-gritty of type coercion aside, the +point is: if your data doesn’t look like what you thought it was going +to look like, type coercion may well be to blame; make sure everything +is the same type in your vectors and your columns of data.frames, or you +will get nasty surprises!


But coercion can also be very useful! For example, in our +cats data likes_string is numeric, but we know +that the 1s and 0s actually represent TRUE and +FALSE (a common way of representing them). We should use +the logical datatype here, which has two states: +TRUE or FALSE, which is exactly what our data +represents. We can ‘coerce’ this column to be logical by +using the as.logical function:


R +



[1] 1 0 1

R +

+cats$likes_string <- as.logical(cats$likes_string)


+ +

Challenge 1 +


An important part of every data analysis is cleaning the input data. +If you know that the input data is all of the same format, +(e.g. numbers), your analysis is much easier! Clean the cat data set +from the chapter about type coercion.


Copy the code template +


Create a new script in RStudio and copy and paste the following code. +Then move on to the tasks below, which help you to fill in the gaps +(______).

# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+# 1. Print the data
+# 2. Show an overview of the table with all data types
+# 3. The "weight" column has the incorrect data type __________.
+#    The correct data type is: ____________.
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+#    print the data again to see the effect
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+#    Calculate the mean to test yourself
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!

Instructions for the tasks +

+ +

Execute the first statement (read.csv(...)). Then print +the data to the console

+ +

Show the content of any variable by typing its name.


Solution to Challenge 1.1 +


Two correct solutions:

+ +

2. Overview of the data types +


The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of +the cats table.

+ +

In the chapter “Data types” we saw two functions that can show data +types. One printed just a single word, the data type name. The other +printed a short form of the data type, and the first few values. We need +the second here.

+ +

Challenge 1 (continued) +


Solution to Challenge 1.2


3. Which data type do we need? +


The shown data type is not the right one for this data (weight of a +cat). Which data type do we need?

  • Why did the read.csv() function not choose the correct +data type?
  • +
  • Fill in the gap in the comment with the correct data type for cat +weight!
  • +
+ +

Scroll up to the section about the type +hierarchy to review the available data types

+ +
  • Weight is expressed on a continuous scale (real numbers). The R data +type for this is “double” (also known as “numeric”).
  • +
  • The fourth row has the value “2.3 or 2.4”. That is not a number but +two, and an english word. Therefore, the “character” data type is +chosen. The whole column is now text, because all values in the same +columns have to be the same data type.
  • +
+ +

4. Correct the problematic value +


The code to assign a new weight value to the problematic fourth row +is given. Think first and then execute it: What will be the data type +after assigning a number like in this example? You can check the data +type after executing to see if you were right.

+ +

Revisit the hierarchy of data types when two different data types are +combined.

+ +

Challenge 1 (continued) +


Solution to challenge 1.4


The data type of the column “weight” is “character”. The assigned +data type is “double”. Combining two data types yields the data type +that is higher in the following hierarchy:

logical < integer < double < complex < character

Therefore, the column is still of type character! We need to manually +convert it to “double”. {: .solution}


5. Convert the column “weight” to the correct data type +


Cat weight are numbers. But the column does not have this data type +yet. Coerce the column to floating point numbers.

+ +

The functions to convert data types start with as.. You +can look for the function further up in the manuscript or use the +RStudio auto-complete function: Type “as.” and then press +the TAB key.

+ +

Challenge 1 (continued) +


Solution to Challenge 1.5


There are two functions that are synonymous for historic reasons:

cats$weight <- as.double(cats$weight)
+cats$weight <- as.numeric(cats$weight)

Some basic vector functions +


The combine function, c(), will also append things to an +existing vector:


R +

+ab_vector <- c('a', 'b')


[1] "a" "b"

R +

+combine_example <- c(ab_vector, 'SWC')


[1] "a"   "b"   "SWC"

You can also make series of numbers:


R +

+mySeries <- 1:10


 [1]  1  2  3  4  5  6  7  8  9 10

R +



 [1]  1  2  3  4  5  6  7  8  9 10

R +

+seq(1,10, by=0.1)


 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
+[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
+[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
+[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
+[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
+[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
+[91] 10.0

We can ask a few questions about vectors:


R +

+sequence_example <- 20:25
+head(sequence_example, n=2)


[1] 20 21

R +

+tail(sequence_example, n=4)


[1] 22 23 24 25

R +



[1] 6

R +



[1] "integer"

We can get individual elements of a vector by using the bracket +notation:


R +

+first_element <- sequence_example[1]


[1] 20

To change a single element, use the bracket on the other side of the +arrow:


R +

+sequence_example[1] <- 30


[1] 30 21 22 23 24 25
+ +

Challenge 2 +


Start by making a vector with the numbers 1 through 26. Then, +multiply the vector by 2.

+ +

R +

+x <- 1:26
+x <- x * 2

Lists +


Another data structure you’ll want in your bag of tricks is the +list. A list is simpler in some ways than the other types, +because you can put anything you want in it. Remember everything in +the vector must be of the same basic data type, but a list can have +different data types:


R +

+list_example <- list(1, "a", TRUE, 1+4i)


+[1] 1
+[1] "a"
+[1] TRUE
+[1] 1+4i

When printing the object structure with str(), we see +the data types of all elements:


R +



List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i

What is the use of lists? They can organize data of different +types. For example, you can organize different tables that +belong together, similar to spreadsheets in Excel. But there are many +other uses, too.


We will see another example that will maybe surprise you in the next +chapter.


To retrieve one of the elements of a list, use the double +bracket:


R +



[1] "a"

The elements of lists also can have names, they can +be given by prepending them to the values, separated by an equals +sign:


R +

+another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )


+[1] "Numbers"
+ [1]  1  2  3  4  5  6  7  8  9 10
+[1] TRUE

This results in a named list. Now we have a new +function of our object! We can access single elements by an additional +way!


R +



[1] "Numbers"

Names +


With names, we can give meaning to elements. It is the first time +that we do not only have the data, but also explaining +information. It is metadata that can be stuck to the object +like a label. In R, this is called an attribute. Some +attributes enable us to do more with our object, for example, like here, +accessing an element by a self-defined name.


Accessing vectors and lists by name +


We have already seen how to generate a named list. The way to +generate a named vector is very similar. You have seen this function +before:


R +

+pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )

The way to retrieve elements is different, though:


R +



+       5.64 

The approach used for the list does not work:


R +



Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors

It will pay off if you remember this error message, you will meet it +in your own analyses. It means that you have just tried accessing an +element like it was in a list, but it is actually in a vector.


Accessing and changing names +


If you are only interested in the names, use the names() +function:


R +



[1] "pizzasubito" "pizzafresh"  "callapizza" 

We have seen how to access and change single elements of a vector. +The same is possible for names:


R +



[1] "callapizza"

R +

+names(pizza_price)[3] <- "call-a-pizza"


 pizzasubito   pizzafresh call-a-pizza 
+        5.64         6.60         4.50 
+ +

Challenge 3 +

  • What is the data type of the names of pizza_price? You +can find out using the str() or typeof() +functions.
  • +
+ +

You get the names of an object by wrapping the object name inside +names(...). Similarly, you get the data type of the names +by again wrapping the whole code in typeof(...):


alternatively, use a new variable if this is easier for you to +read:

n <- names(pizza)
+ +

Challenge 4 +


Instead of just changing some of the names a vector/list already has, +you can also set all names of an object by writing code like (replace +ALL CAPS text):


Create a vector that gives the number for each letter in the +alphabet!

  1. Generate a vector called letter_no with the sequence of +numbers from 1 to 26!
  2. +
  3. R has a built-in object called LETTERS. It is a +26-character vector, from A to Z. Set the names of the number sequence +to this 26 letters
  4. +
  5. Test yourself by calling letter_no["B"], which should +give you the number 2!
  6. +
+ +
letter_no <- 1:26   # or seq(1,26)
+names(letter_no) <- LETTERS

Data frames +


We have data frames at the very beginning of this lesson, they +represent a table of data. We didn’t go much further into detail with +our example cat data frame:


R +



    coat weight likes_string
+1 calico    2.1         TRUE
+2  black    5.0        FALSE
+3  tabby    3.2         TRUE

We can now understand something a bit surprising in our data.frame; +what happens if we run:


R +



[1] "list"

We see that data.frames look like lists ‘under the hood’. Think again +what we heard about what lists can be used for:


Lists organize data of different types


Columns of a data frame are vectors of different types, that are +organized by belonging to the same table.


A data.frame is really a list of vectors. It is a special list in +which all the vectors must have the same length.


How is this “special”-ness written into the object, so that R does +not treat it like any other list, but as a table?


R +



[1] "data.frame"

A class, just like names, is an attribute attached +to the object. It tells us what this object means for humans.


You might wonder: Why do we need another +what-type-of-object-is-this-function? We already have +typeof()? That function tells us how the object is +constructed in the computer. The class is +the meaning of the object for humans. Consequently, +what typeof() returns is fixed in R (mainly the +five data types), whereas the output of class() is +diverse and extendable by R packages.


In our cats example, we have an integer, a double and a +logical variable. As we have seen already, each column of data.frame is +a vector.


R +



[1] "calico" "black"  "tabby" 

R +



[1] "calico" "black"  "tabby" 

R +



[1] "character"

R +



 chr [1:3] "calico" "black" "tabby"

Each row is an observation of different variables, itself a +data.frame, and thus can be composed of elements of different types.


R +



    coat weight likes_string
+1 calico    2.1         TRUE

R +



[1] "list"

R +



'data.frame':	1 obs. of  3 variables:
+ $ coat        : chr "calico"
+ $ weight      : num 2.1
+ $ likes_string: logi TRUE
+ +

Challenge 5 +


There are several subtly different ways to call variables, +observations and elements from data.frames:

  • cats[1]
  • +
  • cats[[1]]
  • +
  • cats$coat
  • +
  • cats["coat"]
  • +
  • cats[1, 1]
  • +
  • cats[, 1]
  • +
  • cats[1, ]
  • +

Try out these examples and explain what is returned by each one.


Hint: Use the function typeof() to examine what +is returned in each case.

+ +

R +



+1 calico
+2  black
+3  tabby

We can think of a data frame as a list of vectors. The single brace +[1] returns the first slice of the list, as another list. +In this case it is the first column of the data frame.


R +



[1] "calico" "black"  "tabby" 

The double brace [[1]] returns the contents of the list +item. In this case it is the contents of the first column, a +vector of type character.


R +



[1] "calico" "black"  "tabby" 

This example uses the $ character to address items by +name. coat is the first column of the data frame, again a +vector of type character.


R +



+1 calico
+2  black
+3  tabby

Here we are using a single brace ["coat"] replacing the +index number with the column name. Like example 1, the returned object +is a list.


R +

+cats[1, 1]


[1] "calico"

This example uses a single brace, but this time we provide row and +column coordinates. The returned object is the value in row 1, column 1. +The object is a vector of type character.


R +

+cats[, 1]


[1] "calico" "black"  "tabby" 

Like the previous example we use single braces and provide row and +column coordinates. The row coordinate is not specified, R interprets +this missing value as all the elements in this column and +returns them as a vector.


R +

+cats[1, ]


    coat weight likes_string
+1 calico    2.1         TRUE

Again we use the single brace with row and column coordinates. The +column coordinate is not specified. The return value is a list +containing all the values in the first row.

+ +

Tip: Renaming data frame columns +


Data frames have column names, which can be accessed with the +names() function.


R +



[1] "coat"         "weight"       "likes_string"

If you want to rename the second column of cats, you can +assign a new name to the second element of names(cats).


R +

+names(cats)[2] <- "weight_kg"


    coat weight_kg likes_string
+1 calico       2.1         TRUE
+2  black       5.0        FALSE
+3  tabby       3.2         TRUE

Matrices +


Last but not least is the matrix. We can declare a matrix full of +zeros:


R +

+matrix_example <- matrix(0, ncol=6, nrow=3)


     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    0    0    0    0    0    0
+[2,]    0    0    0    0    0    0
+[3,]    0    0    0    0    0    0

What makes it special is the dim() attribute:


R +



[1] 3 6

And similar to other data structures, we can ask things about our +matrix:


R +



[1] "double"

R +



[1] "matrix" "array" 

R +



 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...

R +



[1] 3

R +



[1] 6
+ +

Challenge 6 +


What do you think will be the result of +length(matrix_example)? Try it. Were you right? Why / why +not?

+ +

What do you think will be the result of +length(matrix_example)?


R +

+matrix_example <- matrix(0, ncol=6, nrow=3)


[1] 18

Because a matrix is a vector with added dimension attributes, +length gives you the total number of elements in the +matrix.

+ +

Challenge 7 +


Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)

+ +

Make another matrix, this time containing the numbers 1:50, with 5 +columns and 10 rows. Did the matrix function fill your +matrix by column, or by row, as its default behaviour? See if you can +figure out how to change this. (hint: read the documentation for +matrix!)


R +

+x <- matrix(1:50, ncol=5, nrow=10)
+x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
+ +

Challenge 8 +


Create a list of length two containing a character vector for each of +the sections in this part of the workshop:

  • Data types
  • +
  • Data structures
  • +

Populate each character vector with the names of the data types and +data structures we’ve seen so far.

+ +

R +

+dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
+dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
+answer <- list(dataTypes, dataStructures)

Note: it’s nice to make a list in big writing on the board or taped +to the wall listing all of these types and structures - leave it up for +the rest of the workshop to remind people of the importance of these +basics.

+ +

Challenge 9 +


Consider the R output of the matrix below:



     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.

  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. +
  3. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  4. +
  5. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  6. +
  7. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
  8. +
+ +

Consider the R output of the matrix below:



     [,1] [,2]
+[1,]    4    1
+[2,]    9    5
+[3,]   10    7

What was the correct command used to write this matrix? Examine each +command and try to figure out the correct one before typing them. Think +about what matrices the other commands will produce.


R +

+matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)
+ +

Keypoints +

  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +

Content from Exploring Data Frames


Last updated on 2023-10-26 | + + Edit this page


Estimated time 30 minutes

+ +




  • How can I manipulate a data frame?
  • +


  • Add and remove rows or columns.
  • +
  • Append two data frames.
  • +
  • Display basic properties of data frames including size and class of +the columns, names, and first few rows.
  • +

At this point, you’ve seen it all: in the last lesson, we toured all +the basic data types and data structures in R. Everything you do will be +a manipulation of those tools. But most of the time, the star of the +show is the data frame—the table that we created by loading information +from a csv file. In this lesson, we’ll learn a few more things about +working with data frames.


Adding columns and rows in data frames +


We already learned that the columns of a data frame are vectors, so +that our data are consistent in type throughout the columns. As such, if +we want to add a new column, we can start by making a new vector:


R +

+age <- c(2, 3, 5)


    coat weight likes_string
+1 calico    2.1            1
+2  black    5.0            0
+3  tabby    3.2            1

We can then add this as a column via:


R +

+cbind(cats, age)


    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5

Note that if we tried to add a vector of ages with a different number +of entries than the number of rows in the data frame, it would fail:


R +

+age <- c(2, 3, 5, 12)
+cbind(cats, age)


Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4

R +

+age <- c(2, 3)
+cbind(cats, age)


Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2

Why didn’t this work? Of course, R wants to see one element in our +new column for every row in the table:


R +



[1] 3

R +



[1] 2

So for it to work we need to have nrow(cats) = +length(age). Let’s overwrite the content of cats with our +new data frame.


R +

+age <- c(2, 3, 5)
+cats <- cbind(cats, age)

Now how about adding rows? We already know that the rows of a data +frame are lists:


R +

+newRow <- list("tortoiseshell", 3.3, TRUE, 9)
+cats <- rbind(cats, newRow)

Let’s confirm that our new row was added correctly.


R +



           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9

Removing rows +


We now know how to add rows and columns to our data frame in R. Now +let’s learn to remove rows.


R +



           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9

We can ask for a data frame minus the last row:


R +

+cats[-4, ]


    coat weight likes_string age
+1 calico    2.1            1   2
+2  black    5.0            0   3
+3  tabby    3.2            1   5

Notice the comma with nothing after it to indicate that we want to +drop the entire fourth row.


Note: we could also remove several rows at once by putting the row +numbers inside of a vector, for example: +cats[c(-3,-4), ]


Removing columns +


We can also remove columns in our data frame. What if we want to +remove the column “age”. We can remove it in two ways, by variable +number or by index.


R +



           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1

Notice the comma with nothing before it, indicating we want to keep +all of the rows.


Alternatively, we can drop the column by using the index name and the +%in% operator. The %in% operator goes through +each element of its left argument, in this case the names of +cats, and asks, “Does this element occur in the second +argument?”


R +

+drop <- names(cats) %in% c("age")


           coat weight likes_string
+1        calico    2.1            1
+2         black    5.0            0
+3         tabby    3.2            1
+4 tortoiseshell    3.3            1

We will cover subsetting with logical operators like +%in% in more detail in the next episode. See the section Subsetting through other logical +operations


Appending to a data frame +


The key to remember when adding data to a data frame is that +columns are vectors and rows are lists. We can also glue two +data frames together with rbind:


R +

+cats <- rbind(cats, cats)


           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9

But now the row names are unnecessarily complicated. We can remove +the rownames, and R will automatically re-name them sequentially:


R +

+rownames(cats) <- NULL


           coat weight likes_string age
+1        calico    2.1            1   2
+2         black    5.0            0   3
+3         tabby    3.2            1   5
+4 tortoiseshell    3.3            1   9
+5        calico    2.1            1   2
+6         black    5.0            0   3
+7         tabby    3.2            1   5
+8 tortoiseshell    3.3            1   9
+ +

Challenge 1 +


You can create a new data frame right from within R with the +following syntax:


R +

+df <- data.frame(id = c("a", "b", "c"),
+                 x = 1:3,
+                 y = c(TRUE, TRUE, FALSE))

Make a data frame that holds the following information for +yourself:

  • first name
  • +
  • last name
  • +
  • lucky number
  • +

Then use rbind to add an entry for the people sitting +beside you. Finally, use cbind to add a column with each +person’s answer to the question, “Is it time for coffee break?”

+ +

R +

+df <- data.frame(first = c("Grace"),
+                 last = c("Hopper"),
+                 lucky_number = c(0))
+df <- rbind(df, list("Marie", "Curie", 238) )
+df <- cbind(df, coffeetime = c(TRUE,TRUE))

Realistic example +


So far, you have seen the basics of manipulating data frames with our +cat data; now let’s use those skills to digest a more realistic dataset. +Let’s read in the gapminder dataset that we downloaded +previously:


R +

+gapminder <- read.csv("data/gapminder_data.csv")
+ +

Miscellaneous Tips +

  • Another type of file you might encounter are tab-separated value +files (.tsv). To specify a tab as a separator, use "\\t" or +read.delim().

  • +
  • Files can also be downloaded directly from the Internet into a +local folder of your choice onto your computer using the +download.file function. The read.csv function +can then be executed to read the downloaded file from the download +location, for example,

  • +

R +

+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv("data/gapminder_data.csv")
  • Alternatively, you can also read in files directly into R from the +Internet by replacing the file paths with a web address in +read.csv. One should note that in doing this no local copy +of the csv file is first saved onto your computer. For example,
  • +

R +

+gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
  • You can read directly from excel spreadsheets without converting +them to plain text first by using the readxl +package.

  • +
  • The argument “stringsAsFactors” can be useful to tell R how to +read strings either as factors or as character strings. In R versions +after 4.0, all strings are read-in as characters by default, but in +earlier versions of R, strings are read-in as factors by default. For +more information, see the call-out in the +previous episode.

  • +

Let’s investigate gapminder a bit; the first thing we should always +do is check out what the data looks like with str:


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...

An additional method for examining the structure of gapminder is to +use the summary function. This function can be used on +various objects in R. For data frames, summary yields a +numeric, tabular, or descriptive summary of each column. Numeric or +integer columns are described by the descriptive statistics (quartiles +and mean), and character columns by its length, class, and mode.


R +



   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  

Along with the str and summary functions, +we can examine individual columns of the data frame with our +typeof function:


R +



[1] "integer"

R +



[1] "character"

R +



 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

We can also interrogate the data frame for information about its +dimensions; remembering that str(gapminder) said there were +1704 observations of 6 variables in gapminder, what do you think the +following will produce, and why?


R +



[1] 6

A fair guess would have been to say that the length of a data frame +would be the number of rows it has (1704), but this is not the case; +remember, a data frame is a list of vectors and factors:


R +



[1] "list"

When length gave us 6, it’s because gapminder is built +out of a list of 6 columns. To get the number of rows and columns in our +dataset, try:


R +



[1] 1704

R +



[1] 6

Or, both at once:


R +



[1] 1704    6

We’ll also likely want to know what the titles of all the columns +are, so we can ask for them later:


R +



[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

At this stage, it’s important to ask ourselves if the structure R is +reporting matches our intuition or expectations; do the basic data types +reported for each column make sense? If not, we need to sort any +problems out now before they turn into bad surprises down the road, +using what we’ve learned about how R interprets data, and the importance +of strict consistency in how we record our data.


Once we’re happy that the data types and structures seem reasonable, +it’s time to start digging into our data proper. Check out the first few +lines:


R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134
+ +

Challenge 2 +


It’s good practice to also check the last few lines of your data and +some in the middle. How would you do this?


Searching for ones specifically in the middle isn’t too hard, but we +could ask for a few lines at random. How would you code this?

+ +

To check the last few lines it’s relatively simple as R already has a +function for this:


R +

+tail(gapminder, n = 15)

What about a few arbitrary rows just in case something is odd in the +middle?


Tip: There are several ways to achieve this. +


The solution here presents one form of using nested functions, i.e. a +function passed as an argument to another function. This might sound +like a new concept, but you are already using it! Remember +my_dataframe[rows, cols] will print to screen your data frame with the +number of rows and columns you asked for (although you might have asked +for a range or named columns for example). How would you get the last +row if you don’t know how many rows your data frame has? R has a +function for this. What about getting a (pseudorandom) sample? R also +has a function for this.


R +

+gapminder[sample(nrow(gapminder), 5), ]

To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later.

+ +

Challenge 3 +


Go to file -> new file -> R script, and write an R script to +load in the gapminder dataset. Put it in the scripts/ +directory and add it to version control.


Run the script using the source function, using the file +path as its argument (or by pressing the “source” button in +RStudio).

+ +

The source function can be used to use a script within a +script. Assume you would like to load the same type of file over and +over again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again and +again you could just write it once and save it as a script. Then, you +can use source("Your_Script_containing_the_load_function") +in a new script to use the function of that script without writing +everything again. Check out ?source to find out more.


R +

+download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
+gapminder <- read.csv(file = "data/gapminder_data.csv")

To run the script and load the data into the gapminder +variable:


R +

+source(file = "scripts/load-gapminder.R")
+ +

Challenge 4 +


Read the output of str(gapminder) again; this time, use +what you’ve learned about lists and vectors, as well as the output of +functions like colnames and dim to explain +what everything that str prints out for gapminder means. If +there are any parts you can’t interpret, discuss with your +neighbors!

+ +

The object gapminder is a data frame with columns

  • +country and continent are character +strings.
  • +
  • +year is an integer vector.
  • +
  • +pop, lifeExp, and gdpPercap +are numeric vectors.
  • +
+ +

Keypoints +

  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +

Content from Subsetting Data


Last updated on 2023-10-26 | + + Edit this page


Estimated time 50 minutes

+ +




  • How can I work with subsets of data in R?
  • +


  • To be able to subset vectors, factors, matrices, lists, and data +frames
  • +
  • To be able to extract individual and multiple elements: by index, by +name, using comparison operations
  • +
  • To be able to skip and remove elements from various data +structures.
  • +

R has many powerful subset operators. Mastering them will allow you +to easily perform complex operations on any kind of dataset.


There are six different ways we can subset any kind of object, and +three different subsetting operators for the different data +structures.


Let’s start with the workhorse of R: a simple numeric vector.


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 
+ +

Atomic vectors +


In R, simple vectors containing character strings, numbers, or +logical values are called atomic vectors because they can’t be +further simplified.


So now that we’ve created a dummy vector to play with, how do we get +at its contents?


Accessing elements using their indices +


To extract elements of a vector we can give their corresponding +index, starting from one:


R +




R +




It may look different, but the square brackets operator is a +function. For vectors (and matrices), it means “get me the nth +element”.


We can ask for multiple elements at once:


R +

+x[c(1, 3)]


  a   c 
+5.4 7.1 

Or slices of the vector:


R +



  a   b   c   d 
+5.4 6.2 7.1 4.8 

the : operator creates a sequence of numbers from the +left element to the right.


R +



[1] 1 2 3 4

R +

+c(1, 2, 3, 4)


[1] 1 2 3 4

We can ask for the same element multiple times:


R +



  a   a   c 
+5.4 5.4 7.1 

If we ask for an index beyond the length of the vector, R will return +a missing value:


R +



+  NA 

This is a vector of length one containing an NA, whose +name is also NA.


If we ask for the 0th element, we get an empty vector:


R +



named numeric(0)
+ +

Vector numbering in R starts at 1 +


In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1.


Skipping and removing elements +


If we use a negative number as the index of a vector, R will return +every element except for the one specified:


R +



  a   c   d   e 
+5.4 7.1 4.8 7.5 

We can skip multiple elements:


R +

+x[c(-1, -5)]  # or x[-c(1,5)]


  b   c   d 
+6.2 7.1 4.8 
+ +

Tip: Order of operations +


A common trip up for novices occurs when trying to skip slices of a +vector. It’s natural to try to negate a sequence like so:


R +


This gives a somewhat cryptic error:



Error in x[-1:3]: only 0's may be mixed with negative subscripts

But remember the order of operations. : is really a +function. It takes its first argument as -1, and its second as 3, so +generates the sequence of numbers: c(-1, 0, 1, 2, 3).


The correct solution is to wrap that function call in brackets, so +that the - operator applies to the result:


R +



  d   e 
+4.8 7.5 

To remove elements from a vector, we need to assign the result back +into the variable:


R +

+x <- x[-4]


  a   b   c   e 
+5.4 6.2 7.1 7.5 
+ +

Challenge 1 +


Given the following code:


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 

Come up with at least 2 different commands that will produce the +following output:



  b   c   d 
+6.2 7.1 4.8 

After you find 2 different commands, compare notes with your +neighbour. Did you have different strategies?

+ +

R +



  b   c   d 
+6.2 7.1 4.8 

R +



  b   c   d 
+6.2 7.1 4.8 

R +



  b   c   d 
+6.2 7.1 4.8 

Subsetting by name +


We can extract elements by using their name, instead of extracting by +index:


R +

+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
+x[c("a", "c")]


  a   c 
+5.4 7.1 

This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same!


Subsetting through other logical operations +


We can also use any logical vector to subset:


R +



  c   e 
+7.1 7.5 

Since comparison operators (e.g. >, +<, ==) evaluate to logical vectors, we can +also use them to succinctly subset vectors: the following statement +gives the same result as the previous one.


R +

+x[x > 7]


  c   e 
+7.1 7.5 

Breaking it down, this statement first evaluates x>7, +generating a logical vector +c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the +elements of x corresponding to the TRUE +values.


We can use == to mimic the previous method of indexing +by name (remember you have to use == rather than += for comparisons):


R +

+x[names(x) == "a"]


+ +

Tip: Combining logical conditions +


We often want to combine multiple logical criteria. For example, we +might want to find all the countries that are located in Asia +or Europe and have life expectancies +within a certain range. Several operations for combining logical vectors +exist in R:

  • +&, the “logical AND” operator: returns +TRUE if both the left and right are TRUE.
  • +
  • +|, the “logical OR” operator: returns +TRUE, if either the left or right (or both) are +TRUE.
  • +

You may sometimes see && and || +instead of & and |. These two-character +operators only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them for programming, i.e. deciding +whether to execute a statement.

  • +!, the “logical NOT” operator: converts +TRUE to FALSE and FALSE to +TRUE. It can negate a single logical condition (eg +!TRUE becomes FALSE), or a whole vector of +conditions(eg !c(TRUE, FALSE) becomes +c(FALSE, TRUE)).
  • +

Additionally, you can compare the elements within a single vector +using the all function (which returns TRUE if +every element of the vector is TRUE) and the +any function (which returns TRUE if one or +more elements of the vector are TRUE).

+ +

Challenge 2 +


Given the following code:


R +

+x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
+names(x) <- c('a', 'b', 'c', 'd', 'e')


  a   b   c   d   e 
+5.4 6.2 7.1 4.8 7.5 

Write a subsetting command to return the values in x that are greater +than 4 and less than 7.

+ +

R +

+x_subset <- x[x<7 & x>4]


  a   b   d 
+5.4 6.2 4.8 
+ +

Tip: Non-unique names +


You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have the +same name — although R tries to avoid this — but row names must be +unique.) Consider these examples:


R +

+x <- 1:3


[1] 1 2 3

R +

+names(x) <- c('a', 'a', 'a')


a a a 
+1 2 3 

R +

+x['a']  # only returns first value



R +

+x[names(x) == 'a']  # returns all three values


a a a 
+1 2 3 
+ +

Tip: Getting help for operators +


Remember you can search for help on operators by wrapping them in +quotes: help("%in%") or ?"%in%".


Skipping named elements +


Skipping or removing named elements is a little harder. If we try to +skip one named element by negating the string, R complains (slightly +obscurely) that it doesn’t know how to take the negative of a +string:


R +

+x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'


Error in -"a": invalid argument to unary operator

However, we can use the != (not-equals) operator to +construct a logical vector that will do what we want:


R +

+x[names(x) != "a"]


  b   c   d   e 
+6.2 7.1 4.8 7.5 

Skipping multiple named indices is a little bit harder still. Suppose +we want to drop the "a" and "c" elements, so +we try this:


R +



Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length


  b   c   d   e 
+6.2 7.1 4.8 7.5 

R did something, but it gave us a warning that we ought to +pay attention to - and it apparently gave us the wrong answer +(the "c" element is still included in the vector)!


So what does != actually do in this case? That’s an +excellent question.


Recycling +


Let’s take a look at the comparison component of this code:


R +

+names(x) != c("a", "c")


Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length



Why does R give TRUE as the third element of this +vector, when names(x)[3] != "c" is obviously false? When +you use !=, R tries to compare each element of the left +argument with the corresponding element of its right argument. What +happens when you compare vectors of different lengths?

Inequality testing

When one vector is shorter than the other, it gets +recycled:

Inequality testing: results of recycling

In this case R repeats c("a", "c") as +many times as necessary to match names(x), i.e. we get +c("a","c","a","c","a"). Since the recycled "a" +doesn’t match the third element of names(x), the value of +!= is TRUE. Because in this case the longer +vector length (5) isn’t a multiple of the shorter vector length (2), R +printed a warning message. If we had been unlucky and +names(x) had contained six elements, R would +silently have done the wrong thing (i.e., not what we intended +it to do). This recycling rule can can introduce hard-to-find and subtle +bugs!


The way to get R to do what we really want (match each +element of the left argument with all of the elements of the +right argument) it to use the %in% operator. The +%in% operator goes through each element of its left +argument, in this case the names of x, and asks, “Does this +element occur in the second argument?”. Here, since we want to +exclude values, we also need a ! operator to +change “in” to “not in”:


R +

+x[! names(x) %in% c("a","c") ]


  b   d   e 
+6.2 4.8 7.5 
+ +

Challenge 3 +


Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains country and continent variables, but +no information between these two scales. Suppose we want to pull out +information from southeast Asia: how do we set up an operation to +produce a logical vector that is TRUE for all of the +countries in southeast Asia and FALSE otherwise?


Suppose you have these data:


R +

+seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries <- unique(as.character(gapminder$country))

There’s a wrong way (using only ==), which will give you +a warning; a clunky way (using the logical operators == and +|); and an elegant way (using %in%). See +whether you can come up with all three and explain how they (don’t) +work.

+ +
  • The wrong way to do this problem is +countries==seAsia. This gives a warning +("In countries == seAsia : longer object length is not a multiple of shorter object length") +and the wrong answer (a vector of all FALSE values), +because none of the recycled values of seAsia happen to +line up correctly with matching values in country.
  • +
  • The clunky (but technically correct) way to do this +problem is
  • +

R +

+ (countries=="Myanmar" | countries=="Thailand" |
+ countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")

(or countries==seAsia[1] | countries==seAsia[2] | ...). +This gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?).

  • The best way to do this problem is +countries %in% seAsia, which is both correct and easy to +type (and read).
  • +

Handling special values +


At some point you will encounter functions in R that cannot handle +missing, infinite, or undefined data.


There are a number of special functions you can use to filter out +this data:

  • +is.na will return all positions in a vector, matrix, or +data.frame containing NA (or NaN)
  • +
  • likewise, is.nan, and is.infinite will do +the same for NaN and Inf.
  • +
  • +is.finite will return all positions in a vector, +matrix, or data.frame that do not contain NA, +NaN or Inf.
  • +
  • +na.omit will filter out all missing values from a +vector
  • +

Factor subsetting +


Now that we’ve explored the different ways to subset vectors, how do +we subset the other data structures?


Factor subsetting works the same way as vector subsetting.


R +

+f <- factor(c("a", "a", "b", "c", "c", "d"))
+f[f == "a"]


[1] a a
+Levels: a b c d

R +

+f[f %in% c("b", "c")]


[1] b c c
+Levels: a b c d

R +



[1] a a b
+Levels: a b c d

Skipping elements will not remove the level even if no more of that +category exists in the factor:


R +



[1] a a c c d
+Levels: a b c d

Matrix subsetting +


Matrices are also subsetted using the [ function. In +this case it takes two arguments: the first applying to the rows, the +second to its columns:


R +

+m <- matrix(rnorm(6*4), ncol=4, nrow=6)
+m[3:4, c(3,1)]


            [,1]       [,2]
+[1,]  1.12493092 -0.8356286
+[2,] -0.04493361  1.5952808

You can leave the first or second arguments blank to retrieve all the +rows or columns respectively:


R +

+m[, c(3,4)]


            [,1]        [,2]
+[1,] -0.62124058  0.82122120
+[2,] -2.21469989  0.59390132
+[3,]  1.12493092  0.91897737
+[4,] -0.04493361  0.78213630
+[5,] -0.01619026  0.07456498
+[6,]  0.94383621 -1.98935170

If we only access one row or column, R will automatically convert the +result to a vector:


R +



[1] -0.8356286  0.5757814  1.1249309  0.9189774

If you want to keep the output as a matrix, you need to specify a +third argument; drop = FALSE:


R +

+m[3, , drop=FALSE]


           [,1]      [,2]     [,3]      [,4]
+[1,] -0.8356286 0.5757814 1.124931 0.9189774

Unlike vectors, if we try to access a row or column outside of the +matrix, R will throw an error:


R +

+m[, c(3,6)]


Error in m[, c(3, 6)]: subscript out of bounds
+ +

Tip: Higher dimensional arrays +


when dealing with multi-dimensional arrays, each argument to +[ corresponds to a dimension. For example, a 3D array, the +first three arguments correspond to the rows, columns, and depth +dimension.


Because matrices are vectors, we can also subset using only one +argument:


R +



[1] 0.3295078

This usually isn’t useful, and often confusing to read. However it is +useful to note that matrices are laid out in column-major +format by default. That is the elements of the vector are arranged +column-wise:


R +

+matrix(1:6, nrow=2, ncol=3)


     [,1] [,2] [,3]
+[1,]    1    3    5
+[2,]    2    4    6

If you wish to populate the matrix by row, use +byrow=TRUE:


R +

+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)


     [,1] [,2] [,3]
+[1,]    1    2    3
+[2,]    4    5    6

Matrices can also be subsetted using their rownames and column names +instead of their row and column indices.

+ +

Challenge 4 +


Given the following code:


R +

+m <- matrix(1:18, nrow=3, ncol=6)


     [,1] [,2] [,3] [,4] [,5] [,6]
+[1,]    1    4    7   10   13   16
+[2,]    2    5    8   11   14   17
+[3,]    3    6    9   12   15   18
  1. Which of the following commands will extract the values 11 and +14?
  2. +

A. m[2,4,2,5]


B. m[2:5]


C. m[4:5,2]


D. m[2,c(4,5)]

+ +



List subsetting +


Now we’ll introduce some new subsetting operators. There are three +functions used to subset lists. We’ve already seen these when learning +about atomic vectors and matrices: [, [[, and +$.


Using [ will always return a list. If you want to +subset a list, but not extract an element, then you +will likely use [.


R +

+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))


+[1] "Software Carpentry"

This returns a list with one element.


We can subset elements of a list exactly the same way as atomic +vectors using [. Comparison operations however won’t work +as they’re not recursive, they will try to condition on the data +structures in each element of the list, not the individual elements +within those data structures.


R +



+[1] "Software Carpentry"
+ [1]  1  2  3  4  5  6  7  8  9 10

To extract individual elements of a list, you need to use the +double-square bracket function: [[.


R +



[1] "Software Carpentry"

Notice that now the result is a vector, not a list.


You can’t extract more than one element at once:


R +



Error in xlist[[1:2]]: subscript out of bounds

Nor use it to skip elements:


R +



Error in xlist[[-1]]: invalid negative subscript in get1index <real>

But you can use names to both subset and extract elements:


R +



[1] "Software Carpentry"

The $ function is a shorthand way for extracting +elements by name:


R +



                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
+Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
+ +

Challenge 5 +


Given the following list:


R +

+xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))

Using your knowledge of both list and vector subsetting, extract the +number 2 from xlist. Hint: the number 2 is contained within the “b” item +in the list.

+ +

R +



[1] 2

R +



[1] 2

R +



[1] 2
+ +

Challenge 6 +


Given a linear model:


R +

+mod <- aov(pop ~ lifeExp, data=gapminder)

Extract the residual degrees of freedom (hint: +attributes() will help you)

+ +

R +

+attributes(mod) ## `df.residual` is one of the names of `mod`

R +


Data frames +


Remember the data frames are lists underneath the hood, so similar +rules apply. However they are also two dimensional objects:


[ with one argument will act the same way as for lists, +where each list element corresponds to a column. The resulting object +will be a data frame:


R +



+1  8425333
+2  9240934
+3 10267083
+4 11537966
+5 13079460
+6 14880372

Similarly, [[ will act to extract a single +column:


R +



[1] 28.801 30.332 31.997 34.020 36.088 38.438

And $ provides a convenient shorthand to extract columns +by name:


R +



[1] 1952 1957 1962 1967 1972 1977

With two arguments, [ behaves the same way as for +matrices:


R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007

If we subset a single row, the result will be a data frame (because +the elements are mixed types):


R +



      country year      pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007

But for a single column the result will be a vector (this can be +changed with the third argument, drop = FALSE).

+ +

Challenge 7 +


Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
  2. +

R +

gapminder[gapminder$year = 1957,]
  1. Extract all columns except 1 through to 4
  2. +

R +

  1. Extract the rows where the life expectancy is longer the 80 +years
  2. +

R +

+gapminder[gapminder$lifeExp > 80]
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +

R +

+gapminder[1, 4, 5]
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +

R +

+gapminder[gapminder$year == 2002 | 2007,]
+ +

Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957
  2. +

R +

+# gapminder[gapminder$year = 1957,]
+gapminder[gapminder$year == 1957,]
  1. Extract all columns except 1 through to 4
  2. +

R +

+# gapminder[,-1:4]
  1. Extract the rows where the life expectancy is longer than 80 +years
  2. +

R +

+# gapminder[gapminder$lifeExp > 80]
+gapminder[gapminder$lifeExp > 80,]
  1. Extract the first row, and the fourth and fifth columns +(continent and lifeExp).
  2. +

R +

+# gapminder[1, 4, 5]
+gapminder[1, c(4, 5)]
  1. Advanced: extract rows that contain information for the years 2002 +and 2007
  2. +

R +

+# gapminder[gapminder$year == 2002 | 2007,]
+gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
+gapminder[gapminder$year %in% c(2002, 2007),]
+ +

Challenge 8 +

  1. Why does gapminder[1:20] return an error? How does +it differ from gapminder[1:20, ]?

  2. +
  3. Create a new data.frame called +gapminder_small that only contains rows 1 through 9 and 19 +through 23. You can do this in one or two steps.

  4. +
+ +
  1. gapminder is a data.frame so needs to be subsetted +on two dimensions. gapminder[1:20, ] subsets the data to +give the first 20 rows and all columns.

  2. +
  3. +
  4. +

R +

+gapminder_small <- gapminder[c(1:9, 19:23),]
+ +

Keypoints +

  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +

Content from Control Flow


Last updated on 2023-10-26 | + + Edit this page


Estimated time 65 minutes

+ +




  • How can I make data-dependent choices in R?
  • +
  • How can I repeat operations in R?
  • +


  • Write conditional statements with if...else statements +and ifelse().
  • +
  • Write and understand for() loops.
  • +

Often when we’re coding we want to control the flow of our actions. +This can be done by setting actions to occur only if a condition or a +set of conditions are met. Alternatively, we can also set an action to +occur a particular number of times.


There are several ways you can control flow in R. For conditional +statements, the most commonly used approaches are the constructs:


R +

# if
+if (condition is true) {
+  perform action
+# if ... else
+if (condition is true) {
+  perform action
+} else {  # that is, if the condition is false,
+  perform alternative action

Say, for example, that we want R to print a message if a variable +x has a particular value:


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")


[1] 8

The print statement does not appear in the console because x is not +greater than 10. To print a different message for numbers less than 10, +we can add an else statement.


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else {
+  print("x is less than 10")


[1] "x is less than 10"

You can also test multiple conditions by using +else if.


R +

+x <- 8
+if (x >= 10) {
+  print("x is greater than or equal to 10")
+} else if (x > 5) {
+  print("x is greater than 5, but less than 10")
+} else {
+  print("x is less than 5")


[1] "x is greater than 5, but less than 10"

Important: when R evaluates the condition inside +if() statements, it is looking for a logical element, i.e., +TRUE or FALSE. This can cause some headaches +for beginners. For example:


R +

+x  <-  4 == 3
+if (x) {
+  "4 equals 3"
+} else {
+  "4 does not equal 3"


[1] "4 does not equal 3"

As we can see, the not equal message was printed because the vector x +is FALSE


R +

+x <- 4 == 3


+ +

Challenge 1 +


Use an if() statement to print a suitable message +reporting whether there are any records from 2002 in the +gapminder dataset. Now do the same for 2012.

+ +

We will first see a solution to Challenge 1 which does not use the +any() function. We first obtain a logical vector describing +which element of gapminder$year is equal to +2002:


R +

+gapminder[(gapminder$year == 2002),]

Then, we count the number of rows of the data.frame +gapminder that correspond to the 2002:


R +

+rows2002_number <- nrow(gapminder[(gapminder$year == 2002),])

The presence of any record for the year 2002 is equivalent to the +request that rows2002_number is one or more:


R +

+rows2002_number >= 1

Putting all together, we obtain:


R +

+if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){
+   print("Record(s) for the year 2002 found.")

All this can be done more quickly with any(). The +logical condition can be expressed as:


R +

+if(any(gapminder$year == 2002)){
+   print("Record(s) for the year 2002 found.")

Did anyone get a warning message like this?



Error in if (gapminder$year == 2012) {: the condition has length > 1

The if() function only accepts singular (of length 1) +inputs, and therefore returns an error when you use it with a vector. +The if() function will still run, but will only evaluate +the condition in the first element of the vector. Therefore, to use the +if() function, you need to make sure your input is singular +(of length 1).

+ +

Tip: Built in ifelse() +function +


R accepts both if() and +else if() statements structured as outlined above, but also +statements using R’s built-in ifelse() +function. This function accepts both singular and vector inputs and is +structured as follows:


R +

# ifelse function
+ifelse(condition is true, perform action, perform alternative action)

where the first argument is the condition or a set of conditions to +be met, the second argument is the statement that is evaluated when the +condition is TRUE, and the third statement is the statement +that is evaluated when the condition is FALSE.


R +

+y <- -3
+ifelse(y < 0, "y is a negative number", "y is either positive or zero")


[1] "y is a negative number"
+ +

Tip: any() and +all() +


The any() function will return TRUE if at +least one TRUE value is found within a vector, otherwise it +will return FALSE. This can be used in a similar way to the +%in% operator. The function all(), as the name +suggests, will only return TRUE if all values in the vector +are TRUE.


Repeating operations +


If you want to iterate over a set of values, when the order of +iteration is important, and perform the same operation on each, a +for() loop will do the job. We saw for() loops +in the shell +lessons earlier. This is the most flexible of looping operations, +but therefore also the hardest to use correctly. In general, the advice +of many R users would be to learn about for() +loops, but to avoid using for() loops unless the order of +iteration is important: i.e. the calculation at each iteration depends +on the results of previous iterations. If the order of iteration is not +important, then you should learn about vectorized alternatives, such as +the purrr package, as they pay off in computational +efficiency.


The basic structure of a for() loop is:


R +

for (iterator in set of values) {
+  do a thing

For example:


R +

+for (i in 1:10) {
+  print(i)


[1] 1
+[1] 2
+[1] 3
+[1] 4
+[1] 5
+[1] 6
+[1] 7
+[1] 8
+[1] 9
+[1] 10

The 1:10 bit creates a vector on the fly; you can +iterate over any other vector as well.


We can use a for() loop nested within another +for() loop to iterate over two things at once.


R +

+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    print(paste(i,j))
+  }


[1] "1 a"
+[1] "1 b"
+[1] "1 c"
+[1] "1 d"
+[1] "1 e"
+[1] "2 a"
+[1] "2 b"
+[1] "2 c"
+[1] "2 d"
+[1] "2 e"
+[1] "3 a"
+[1] "3 b"
+[1] "3 c"
+[1] "3 d"
+[1] "3 e"
+[1] "4 a"
+[1] "4 b"
+[1] "4 c"
+[1] "4 d"
+[1] "4 e"
+[1] "5 a"
+[1] "5 b"
+[1] "5 c"
+[1] "5 d"
+[1] "5 e"

We notice in the output that when the first index (i) is +set to 1, the second index (j) iterates through its full +set of indices. Once the indices of j have been iterated +through, then i is incremented. This process continues +until the last index has been used for each for() loop.


Rather than printing the results, we could write the loop output to a +new object.


R +

+output_vector <- c()
+for (i in 1:5) {
+  for (j in c('a', 'b', 'c', 'd', 'e')) {
+    temp_output <- paste(i, j)
+    output_vector <- c(output_vector, temp_output)
+  }


 [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a" "3 b"
+[13] "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b" "5 c" "5 d"
+[25] "5 e"

This approach can be useful, but ‘growing your results’ (building the +result object incrementally) is computationally inefficient, so avoid it +when you are iterating through a lot of values.

+ +

Tip: don’t grow your results +


One of the biggest things that trips up novices and experienced R +users alike, is building a results object (vector, list, matrix, data +frame) as your for loop progresses. Computers are very bad at handling +this, so your calculations can very quickly slow to a crawl. It’s much +better to define an empty results object before hand of appropriate +dimensions, rather than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location.


A better way is to define your (empty) output object before filling +in the values. For this example, it looks more involved, but is still +more efficient.


R +

+output_matrix <- matrix(nrow = 5, ncol = 5)
+j_vector <- c('a', 'b', 'c', 'd', 'e')
+for (i in 1:5) {
+  for (j in 1:5) {
+    temp_j_value <- j_vector[j]
+    temp_output <- paste(i, temp_j_value)
+    output_matrix[i, j] <- temp_output
+  }
+output_vector2 <- as.vector(output_matrix)


 [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c" "2 c"
+[13] "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e" "3 e" "4 e"
+[25] "5 e"
+ +

Tip: While loops +


Sometimes you will find yourself needing to repeat an operation as +long as a certain condition is met. You can do this with a +while() loop.


R +

while(this condition is true){
+  do a thing

R will interpret a condition being met as “TRUE”.


As an example, here’s a while loop that generates random numbers from +a uniform distribution (the runif() function) between 0 and +1 until it gets one that’s less than 0.1.


R +

+z <- 1
+while(z > 0.1){
+  z <- runif(1)
+  cat(z, "\n")

while() loops will not always be appropriate. You have +to be particularly careful that you don’t end up stuck in an infinite +loop because your condition is always met and hence the while statement +never terminates.

+ +

Challenge 2 +


Compare the objects output_vector and +output_vector2. Are they the same? If not, why not? How +would you change the last block of code to make +output_vector2 the same as output_vector?

+ +

We can check whether the two vectors are identical using the +all() function:


R +

+all(output_vector == output_vector2)

However, all the elements of output_vector can be found +in output_vector2:


R +

+all(output_vector %in% output_vector2)

and vice versa:


R +

+all(output_vector2 %in% output_vector)

therefore, the element in output_vector and +output_vector2 are just sorted in a different order. This +is because as.vector() outputs the elements of an input +matrix going over its column. Taking a look at +output_matrix, we can notice that we want its elements by +rows. The solution is to transpose the output_matrix. We +can do it either by calling the transpose function t() or +by inputting the elements in the right order. The first solution +requires to change the original


R +

+output_vector2 <- as.vector(output_matrix)



R +

+output_vector2 <- as.vector(t(output_matrix))

The second solution requires to change


R +

+output_matrix[i, j] <- temp_output



R +

+output_matrix[j, i] <- temp_output
+ +

Challenge 3 +


Write a script that loops through the gapminder data by +continent and prints out whether the mean life expectancy is smaller or +larger than 50 years.

+ +

Step 1: We want to make sure we can extract all the +unique values of the continent vector


R +

+gapminder <- read.csv("data/gapminder_data.csv")

Step 2: We also need to loop over each of these +continents and calculate the average life expectancy for each +subset of data. We can do that as follows:

  1. Loop over each of the unique values of ‘continent’
  2. +
  3. For each value of continent, create a temporary variable storing +that subset
  4. +
  5. Return the calculated life expectancy to the user by printing the +output:
  6. +

R +

+for (iContinent in unique(gapminder$continent)) {
+  tmp <- gapminder[gapminder$continent == iContinent, ]
+  cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n")
+  rm(tmp)

Step 3: The exercise only wants the output printed +if the average life expectancy is less than 50 or greater than 50. So we +need to add an if() condition before printing, which +evaluates whether the calculated average life expectancy is above or +below a threshold, and prints an output conditional on the result. We +need to amend (3) from above:


3a. If the calculated life expectancy is less than some threshold (50 +years), return the continent and a statement that life expectancy is +less than threshold, otherwise return the continent and a statement that +life expectancy is greater than threshold:


R +

+thresholdValue <- 50
+for (iContinent in unique(gapminder$continent)) {
+   tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"])
+   if (tmp < thresholdValue){
+       cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+   } else {
+       cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+   } # end if else condition
+   rm(tmp)
+} # end for loop
+ +

Challenge 4 +


Modify the script from Challenge 3 to loop over each country. This +time print out whether the life expectancy is smaller than 50, between +50 and 70, or greater than 70.

+ +

We modify our solution to Challenge 3 by now adding two thresholds, +lowerThreshold and upperThreshold and +extending our if-else statements:


R +

+ lowerThreshold <- 50
+ upperThreshold <- 70
+for (iCountry in unique(gapminder$country)) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+    if(tmp < lowerThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n")
+    } else if(tmp > lowerThreshold && tmp < upperThreshold) {
+        cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n")
+    } else {
+        cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n")
+    }
+    rm(tmp)
+ +

Challenge 5 - Advanced +


Write a script that loops over each country in the +gapminder dataset, tests whether the country starts with a +‘B’, and graphs life expectancy against time as a line graph if the mean +life expectancy is under 50 years.

+ +

We will use the grep() command that was introduced in +the Unix +Shell lesson to find countries that start with “B.” Lets understand +how to do this first. Following from the Unix shell section we may be +tempted to try the following


R +

+grep("^B", unique(gapminder$country))

But when we evaluate this command it returns the indices of the +factor variable country that start with “B.” To get the +values, we must add the value=TRUE option to the +grep() command:


R +

+grep("^B", unique(gapminder$country), value = TRUE)

We will now store these countries in a variable called +candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each +country, and if the average life expectancy is less than 50 we use +base-plot to plot the evolution of average life expectancy using +with() and subset():


R +

+thresholdValue <- 50
+candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE)
+for (iCountry in candidateCountries) {
+    tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"])
+    if (tmp < thresholdValue) {
+        cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+        with(subset(gapminder, country == iCountry),
+                plot(year, lifeExp,
+                     type = "o",
+                     main = paste("Life Expectancy in", iCountry, "over time"),
+                     ylab = "Life Expectancy",
+                     xlab = "Year"
+                     ) # end plot
+             ) # end with
+    } # end if
+    rm(tmp)
+} # end for loop
+ +

Keypoints +

  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +

Content from Creating Publication-Quality Graphics with ggplot2


Last updated on 2023-10-26 | + + Edit this page


Estimated time 80 minutes

+ +




  • How can I create publication-quality graphics in R?
  • +


  • To be able to use ggplot2 to generate publication-quality +graphics.
  • +
  • To apply geometry, aesthetic, and statistics layers to a ggplot +plot.
  • +
  • To manipulate the aesthetics of a plot using different colors, +shapes, and lines.
  • +
  • To improve data visualization through transforming scales and +paneling by group.
  • +
  • To save a plot created with ggplot to disk.
  • +

Plotting our data is one of the best ways to quickly explore it and +the various relationships between variables.


There are three main plotting systems in R, the base plotting +system, the lattice +package, and the ggplot2 +package.


Today we’ll be learning about the ggplot2 package, because it is the +most effective for creating publication-quality graphics.


ggplot2 is built on the grammar of graphics, the idea that any plot +can be built from the same set of components: a data +set, mapping aesthetics, and graphical +layers:

  • Data sets are the data that you, the user, +provide.

  • +
  • Mapping aesthetics are what connect the data to +the graphics. They tell ggplot2 how to use your data to affect how the +graph looks, such as changing what is plotted on the X or Y axis, or the +size or color of different data points.

  • +
  • Layers are the actual graphical output from +ggplot2. Layers determine what kinds of plot are shown (scatterplot, +histogram, etc.), the coordinate system used (rectangular, polar, +others), and other important aspects of the plot. The idea of layers of +graphics may be familiar to you if you have used image editing programs +like Photoshop, Illustrator, or Inkscape.

  • +

Let’s start off building an example using the gapminder data from +earlier. The most basic function is ggplot, which lets R +know that we’re creating a new plot. Any of the arguments we give the +ggplot function are the global options for the +plot: they apply to all layers on the plot.


R +

+ggplot(data = gapminder)
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to +show on our figure. This is not enough information for +ggplot to actually draw anything. It only creates a blank +slate for other elements to be added to.


Now we’re going to add in the mapping aesthetics +using the aes function. aes tells +ggplot how variables in the data map to +aesthetic properties of the figure, such as which columns of +the data should be used for the x and +y locations.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

Here we told ggplot we want to plot the “gdpPercap” +column of the gapminder data frame on the x-axis, and the “lifeExp” +column on the y-axis. Notice that we didn’t need to explicitly pass +aes these columns +(e.g. x = gapminder[, "gdpPercap"]), this is because +ggplot is smart enough to know to look in the +data for that column!


The final part of making our plot is to tell ggplot how +we want to visually represent the data. We do this by adding a new +layer to the plot using one of the +geom functions.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

Here we used geom_point, which tells ggplot +we want to visually represent the relationship between +x and y as a scatterplot of +points.

+ +

Challenge 1 +


Modify the example so that the figure shows how life expectancy has +changed over time:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()

Hint: the gapminder dataset has a column called “year”, which should +appear on the x-axis.

+ +

Here is one possible solution:


R +

+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +
+ +

Challenge 2 +


In the previous examples and challenge we’ve used the +aes function to tell the scatterplot geom +about the x and y locations of each +point. Another aesthetic property we can modify is the point +color. Modify the code from the previous challenge to +color the points by the “continent” column. What trends +do you see in the data? Are they what you expected?

+ +

The solution presented below adds color=continent to the +call of the aes function. The general trend seems to +indicate an increased life expectancy over the years. On continents with +stronger economies we find a longer life expectancy.


R +

+ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_point()
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +

Layers +


Using a scatterplot probably isn’t the best for visualizing change +over time. Instead, let’s tell ggplot to visualize the data +as a line plot:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
+  geom_line()

Instead of adding a geom_point layer, we’ve added a +geom_line layer.


However, the result doesn’t look quite as we might have expected: it +seems to be jumping around a lot in each continent. Let’s try to +separate the data by country, plotting one line for each country:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line()

We’ve added the group aesthetic, which +tells ggplot to draw a line for each country.


But what if we want to visualize both lines and points on the plot? +We can add another layer to the plot:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
+  geom_line() + geom_point()

It’s important to note that each layer is drawn on top of the +previous layer. In this example, the points have been drawn on top +of the lines. Here’s a demonstration:


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+  geom_line(mapping = aes(color=continent)) + geom_point()

In this example, the aesthetic mapping of +color has been moved from the global plot options in +ggplot to the geom_line layer so it no longer +applies to the points. Now we can clearly see that the points are drawn +on top of the lines.

+ +

Tip: Setting an aesthetic to a value instead +of a mapping +


So far, we’ve seen how to use an aesthetic (such as +color) as a mapping to a variable in the data. +For example, when we use +geom_line(mapping = aes(color=continent)), ggplot will give +a different color to each continent. But what if we want to change the +color of all lines to blue? You may think that +geom_line(mapping = aes(color="blue")) should work, but it +doesn’t. Since we don’t want to create a mapping to a specific variable, +we can move the color specification outside of the aes() +function, like this: geom_line(color="blue").

+ +

Challenge 3 +


Switch the order of the point and line layers from the previous +example. What happened?

+ +

The lines now get drawn over the points!


R +

+ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
+ geom_point() + geom_line(mapping = aes(color=continent))
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

Transformations and statistics +


ggplot2 also makes it easy to overlay statistical models over the +data. To demonstrate we’ll go back to our first example:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point()

Currently it’s hard to see the relationship between the points due to +some strong outliers in GDP per capita. We can change the scale of units +on the x axis using the scale functions. These control the +mapping between the data values and visual values of an aesthetic. We +can also modify the transparency of the points, using the alpha +function, which is especially helpful when you have a large amount of +data which is very clustered.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10()
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +

The scale_x_log10 function applied a transformation to +the coordinate system of the plot, so that each multiple of 10 is evenly +spaced from left to right. For example, a GDP per capita of 1,000 is the +same horizontal distance away from a value of 10,000 as the 10,000 value +is from 100,000. This helps to visualize the spread of the data along +the x-axis.

+ +

Tip Reminder: Setting an aesthetic to a value +instead of a mapping +


Notice that we used geom_point(alpha = 0.5). As the +previous tip mentioned, using a setting outside of the +aes() function will cause this value to be used for all +points, which is what we want in this case. But just like any other +aesthetic setting, alpha can also be mapped to a variable in +the data. For example, we can give a different transparency to each +continent with +geom_point(mapping = aes(alpha = continent)).


We can fit a simple relationship to the data by adding another layer, +geom_smooth:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the +size aesthetic in the geom_smooth +layer:


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", size=1.5)


Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we +set the size aesthetic by passing it as an +argument to geom_smooth. Previously in the lesson we’ve +used the aes function to define a mapping between +data variables and their visual representation.

+ +

Challenge 4a +


Modify the color and size of the points on the point layer in the +previous example.


Hint: do not use the aes function.

+ +

Here a possible solution: Notice that the color argument +is supplied outside of the aes() function. This means that +it applies to all data points on the graph and is not related to a +specific variable.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
+ geom_point(size=3, color="orange") + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)


`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.
+ +

Challenge 4b +


Modify your solution to Challenge 4a so that the points are now a +different shape and are colored by continent with new trendlines. Hint: +The color argument can be used inside the aesthetic.

+ +

Here is a possible solution: Notice that supplying the +color argument inside the aes() functions +enables you to connect it to a certain variable. The shape +argument, as you can see, modifies all data points the same way (it is +outside the aes() call) while the color +argument which is placed inside the aes() call modifies a +point’s color based on its continent value.


R +

+ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
+ geom_point(size=3, shape=17) + scale_x_log10() +
+ geom_smooth(method="lm", size=1.5)


`geom_smooth()` using formula = 'y ~ x'

Multi-panel figures +


Earlier we visualized the change in life expectancy over time across +all countries in one plot. Alternatively, we can split this out over +multiple panels by adding a layer of facet panels.

+ +

Tip +


We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to clutter +the figure. Note that we apply a “theme” definition to rotate the x-axis +labels to maintain readability. Nearly everything in ggplot2 is +customizable.


R +

+americas <- gapminder[gapminder$continent == "Americas",]
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

The facet_wrap layer took a “formula” as its argument, +denoted by the tilde (~). This tells R to draw a panel for each unique +value in the country column of the gapminder dataset.


Modifying text +


To clean this figure up for a publication we need to change some of +the text elements. The x-axis is too cluttered, and the y axis should +read “Life expectancy”, rather than the column name in the data +frame.


We can do this by adding a couple of different layers. The +theme layer controls the axis text, and overall text +size. Labels for the axes, plot title and any legend can be set using +the labs function. Legend titles are set using the same +names we used in the aes specification. Thus below the +color legend title is set using color = "Continent", while +the title of a fill legend would be set using +fill = "MyTitle".


R +

+ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Exporting the plot +


The ggsave() function allows you to export a plot +created with ggplot. You can specify the dimension and resolution of +your plot by adjusting the appropriate arguments (width, +height and dpi) to create high quality +graphics for publication. In order to save the plot from above, we first +assign it to a variable lifeExp_plot, then tell +ggsave to save that plot in png format to a +directory called results. (Make sure you have a +results/ folder in your working directory.)


R +

+lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
+  geom_line() + facet_wrap( ~ country) +
+  labs(
+    x = "Year",              # x axis title
+    y = "Life expectancy",   # y axis title
+    title = "Figure 1",      # main title of figure
+    color = "Continent"      # title of legend
+  ) +
+  theme(axis.text.x = element_text(angle = 90, hjust = 1))
+ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")

There are two nice things about ggsave. First, it +defaults to the last plot, so if you omit the plot argument +it will automatically save the last plot you created with +ggplot. Secondly, it tries to determine the format you want +to save your plot in from the file extension you provide for the +filename (for example .png or .pdf). If you +need to, you can specify the format explicitly in the +device argument.


This is a taste of what you can do with ggplot2. RStudio provides a +really useful cheat +sheet of the different layers available, and more extensive +documentation is available on the ggplot2 website. All +RStudio cheat sheets can be found here. Finally, +if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow +with reusable code to modify!

+ +

Challenge 5 +


Generate boxplots to compare life expectancy between the different +continents during the available years.



  • Rename y axis as Life Expectancy.
  • +
  • Remove x axis labels.
  • +
+ +

Here a possible solution: xlab() and ylab() +set labels for the x and y axes, respectively The axis title, text and +ticks are attributes of the theme and must be modified within a +theme() call.


R +

+ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
+ geom_boxplot() + facet_wrap(~year) +
+ ylab("Life Expectancy") +
+ theme(axis.title.x=element_blank(),
+       axis.text.x = element_blank(),
+       axis.ticks.x = element_blank())
+ +

Keypoints +

  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +

Content from Vectorization


Last updated on 2023-10-26 | + + Edit this page


Estimated time 25 minutes

+ +




  • How can I operate on all the elements of a vector at once?
  • +


  • To understand vectorized operations in R.
  • +

Most of R’s functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through and +act on each element one at a time. This makes writing code more concise, +easy to read, and less error prone.


R +

+x <- 1:4
+x * 2


[1] 2 4 6 8

The multiplication happened to each element of the vector.


We can also add two vectors together:


R +

+y <- 6:9
+x + y


[1]  7  9 11 13

Each element of x was added to its corresponding element +of y:


R +

x:  1  2  3  4
+    +  +  +  +
+y:  6  7  8  9
+    7  9 11 13

Here is how we would add two vectors together using a for loop:


R +

+output_vector <- c()
+for (i in 1:4) {
+  output_vector[i] <- x[i] + y[i]


[1]  7  9 11 13

Compare this to the output using vectorised operations.


R +

+sum_xy <- x + y


[1]  7  9 11 13
+ +

Challenge 1 +


Let’s try this on the pop column of the +gapminder dataset.


Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.

+ +

Let’s try this on the pop column of the +gapminder dataset.


Make a new column in the gapminder data frame that +contains population in units of millions of people. Check the head or +tail of the data frame to make sure it worked.


R +

+gapminder$pop_millions <- gapminder$pop / 1e6


      country year      pop continent lifeExp gdpPercap pop_millions
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453     8.425333
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530     9.240934
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007    10.267083
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971    11.537966
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811    13.079460
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134    14.880372
+ +

Challenge 2 +


On a single graph, plot population, in millions, against year, for +all countries. Do not worry about identifying which country is +which.


Repeat the exercise, graphing only for China, India, and Indonesia. +Again, do not worry about which is which.

+ +

Refresh your plotting skills by plotting population in millions +against year.


R +

+ggplot(gapminder, aes(x = year, y = pop_millions)) +
+ geom_point()
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

R +

+countryset <- c("China","India","Indonesia")
+ggplot(gapminder[gapminder$country %in% countryset,],
+       aes(x = year, y = pop_millions)) +
+  geom_point()
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

Comparison operators, logical operators, and many functions are also +vectorized:


Comparison operators


R +

+x > 2



Logical operators


R +

+a <- x > 3  # or, for clarity, a <- (x > 3)


+ +

Tip: some useful functions for logical +vectors +


any() will return TRUE if any +element of a vector is TRUE.
all() will return TRUE if all +elements of a vector are TRUE.


Most functions also operate element-wise on vectors:




R +

+x <- 1:4


[1] 0.0000000 0.6931472 1.0986123 1.3862944

Vectorized operations work element-wise on matrices:


R +

+m <- matrix(1:12, nrow=3, ncol=4)
+m * -1


     [,1] [,2] [,3] [,4]
+[1,]   -1   -4   -7  -10
+[2,]   -2   -5   -8  -11
+[3,]   -3   -6   -9  -12
+ +

Tip: element-wise vs. matrix +multiplication +


Very important: the operator * gives you element-wise +multiplication! To do matrix multiplication, we need to use the +%*% operator:


R +

+m %*% matrix(1, nrow=4, ncol=1)


+[1,]   22
+[2,]   26
+[3,]   30

R +

+matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)


+[1,]   30

For more on matrix algebra, see the Quick-R +reference guide

+ +

Challenge 3 +


Given the following matrix:


R +

+m <- matrix(1:12, nrow=3, ncol=4)


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12

Write down what you think will happen when you run:

  1. m ^ -1
  2. +
  3. m * c(1, 0, -1)
  4. +
  5. m > c(0, 20)
  6. +
  7. m * c(1, 0, -1, 2)
  8. +

Did you get the output you expected? If not, ask a helper!

+ +

Given the following matrix:


R +

+m <- matrix(1:12, nrow=3, ncol=4)


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    2    5    8   11
+[3,]    3    6    9   12

Write down what you think will happen when you run:

  1. m ^ -1
  2. +


          [,1]      [,2]      [,3]       [,4]
+[1,] 1.0000000 0.2500000 0.1428571 0.10000000
+[2,] 0.5000000 0.2000000 0.1250000 0.09090909
+[3,] 0.3333333 0.1666667 0.1111111 0.08333333
  1. m * c(1, 0, -1)
  2. +


     [,1] [,2] [,3] [,4]
+[1,]    1    4    7   10
+[2,]    0    0    0    0
+[3,]   -3   -6   -9  -12
  1. m > c(0, 20)
  2. +


      [,1]  [,2]  [,3]  [,4]
+ +

Challenge 4 +


We’re interested in looking at the sum of the following sequence of +fractions:


R +

+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000?

+ +

We’re interested in looking at the sum of the following sequence of +fractions:


R +

+ x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

This would be tedious to type out, and impossible for high values of +n. Can you use vectorisation to compute x, when n=100? How about when +n=10,000?


R +



[1] 1.634984

R +



[1] 1.644834

R +

+n <- 10000


[1] 1.644834

We can also obtain the same results using a function:


R +

+inverse_sum_of_squares <- function(n) {
+  sum(1/(1:n)^2)


[1] 1.634984

R +



[1] 1.644834

R +

+n <- 10000


[1] 1.644834
+ +

Tip: Operations on vectors of unequal +length +


Operations can also be performed on vectors of unequal length, +through a process known as recycling. This process +automatically repeats the smaller vector until it matches the length of +the larger vector. R will provide a warning if the larger vector is not +a multiple of the smaller vector.


R +

+x <- c(1, 2, 3)
+y <- c(1, 2, 3, 4, 5, 6, 7)
+x + y


Warning in x + y: longer object length is not a multiple of shorter object


[1] 2 4 6 5 7 9 8

Vector x was recycled to match the length of vector +y


R +

x:  1  2  3  1  2  3  1
+    +  +  +  +  +  +  +
+y:  1  2  3  4  5  6  7
+    2  4  6  5  7  9  8
+ +

Keypoints +

  • Use vectorized operations instead of loops.
  • +

Content from Functions Explained


Last updated on 2023-10-26 | + + Edit this page


Estimated time 60 minutes

+ +




  • How can I write a new function in R?
  • +


  • Define a function that takes arguments.
  • +
  • Return a value from a function.
  • +
  • Check argument conditions with stopifnot() in +functions.
  • +
  • Test a function.
  • +
  • Set default values for function arguments.
  • +
  • Explain why we should divide programs into small, single-purpose +functions.
  • +

If we only had one data set to analyze, it would probably be faster +to load the file into a spreadsheet and use that to plot simple +statistics. However, the gapminder data is updated periodically, and we +may want to pull in that new information later and re-run our analysis +again. We may also obtain similar data from a different source in the +future.


In this lesson, we’ll learn how to write a function so that we can +repeat several operations with a single command.

+ +

What is a function? +


Functions gather a sequence of operations into a whole, preserving it +for ongoing use. Functions provide:

  • a name we can remember and invoke it by
  • +
  • relief from the need to remember the individual operations
  • +
  • a defined set of inputs and expected outputs
  • +
  • rich connections to the larger programming environment
  • +

As the basic building block of most programming languages, +user-defined functions constitute “programming” as much as any single +abstraction can. If you have written a function, you are a computer +programmer.


Defining a function +


Let’s open a new R script file in the functions/ +directory and call it functions-lesson.R.


The general structure of a function is:


R +

+my_function <- function(parameters) {
+  # perform action
+  # return value

Let’s define a function fahr_to_kelvin() that converts +temperatures from Fahrenheit to Kelvin:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

We define fahr_to_kelvin() by assigning it to the output +of function. The list of argument names are contained +within parentheses. Next, the body of +the function–the statements that are executed when it runs–is contained +within curly braces ({}). The statements in the body are +indented by two spaces. This makes the code easier to read but does not +affect how the code operates.


It is useful to think of creating functions like writing a cookbook. +First you define the “ingredients” that your function needs. In this +case, we only need one ingredient to use our function: “temp”. After we +list our ingredients, we then say what we will do with them, in this +case, we are taking our ingredient and applying a set of mathematical +operators to it.


When we call the function, the values we pass to it as arguments are +assigned to those variables so that we can use them inside the function. +Inside the function, we use a return statement to send a +result back to whoever asked for it.

+ +

Tip +


One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the +body of the function. But for clarity, we will explicitly define the +return statement.


Let’s try running our function. Calling our own function is no +different from calling any other function:


R +

+# freezing point of water


[1] 273.15

R +

+# boiling point of water


[1] 373.15
+ +

Challenge 1 +


Write a function called kelvin_to_celsius() that takes a +temperature in Kelvin and returns that temperature in Celsius.


Hint: To convert from Kelvin to Celsius you subtract 273.15

+ +

Write a function called kelvin_to_celsius that takes a +temperature in Kelvin and returns that temperature in Celsius


R +

+kelvin_to_celsius <- function(temp) {
+ celsius <- temp - 273.15
+ return(celsius)

Combining functions +


The real power of functions comes from mixing, matching and combining +them into ever-larger chunks to get the effect we want.


Let’s define two functions that will convert temperature from +Fahrenheit to Kelvin, and Kelvin to Celsius:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)
+kelvin_to_celsius <- function(temp) {
+  celsius <- temp - 273.15
+  return(celsius)
+ +

Challenge 2 +


Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer).

+ +

Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above


R +

+fahr_to_celsius <- function(temp) {
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)

Interlude: Defensive Programming +


Now that we’ve begun to appreciate how writing functions provides an +efficient way to make R code re-usable and modular, we should note that +it is important to ensure that functions only work in their intended +use-cases. Checking function parameters is related to the concept of +defensive programming. Defensive programming encourages us to +frequently check conditions and throw an error if something is wrong. +These checks are referred to as assertion statements because we want to +assert some condition is TRUE before proceeding. They make +it easier to debug because they give us a better idea of where the +errors originate.


Checking conditions with stopifnot() + +


Let’s start by re-examining fahr_to_kelvin(), our +function for converting temperatures from Fahrenheit to Kelvin. It was +defined like so:


R +

+fahr_to_kelvin <- function(temp) {
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

For this function to work as intended, the argument temp +must be a numeric value; otherwise, the mathematical +procedure for converting between the two temperature scales will not +work. To create an error, we can use the function stop(). +For example, since the argument temp must be a +numeric vector, we could check for this condition with an +if statement and throw an error if the condition was +violated. We could augment our function above like so:


R +

+fahr_to_kelvin <- function(temp) {
+  if (!is.numeric(temp)) {
+    stop("temp must be a numeric vector.")
+  }
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

If we had multiple conditions or arguments to check, it would take +many lines of code to check all of them. Luckily R provides the +convenience function stopifnot(). We can list as many +requirements that should evaluate to TRUE; +stopifnot() throws an error if it finds one that is +FALSE. Listing these conditions also serves a secondary +purpose as extra documentation for the function.


Let’s try out defensive programming with stopifnot() by +adding assertions to check the input to our function +fahr_to_kelvin().


We want to assert the following: temp is a numeric +vector. We may do that like so:


R +

+fahr_to_kelvin <- function(temp) {
+  stopifnot(is.numeric(temp))
+  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
+  return(kelvin)

It still works when given proper input.


R +

+# freezing point of water
+fahr_to_kelvin(temp = 32)


[1] 273.15

But fails instantly if given improper input.


R +

+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp = as.factor(32))


Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+ +

Challenge 3 +


Use defensive programming to ensure that our +fahr_to_celsius() function throws an error immediately if +the argument temp is specified inappropriately.

+ +

Extend our previous definition of the function by adding in an +explicit call to stopifnot(). Since +fahr_to_celsius() is a composition of two other functions, +checking inside here makes adding checks to the two component functions +redundant.


R +

+fahr_to_celsius <- function(temp) {
+  stopifnot(is.numeric(temp))
+  temp_k <- fahr_to_kelvin(temp)
+  result <- kelvin_to_celsius(temp_k)
+  return(result)

More on combining functions +


Now, we’re going to define a function that calculates the Gross +Domestic Product of a nation from the data available in our dataset:


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat) {
+  gdp <- dat$pop * dat$gdpPercap
+  return(gdp)

We define calcGDP() by assigning it to the output of +function. The list of argument names are contained within +parentheses. Next, the body of the function -- the statements executed +when you call the function – is contained within curly braces +({}).


We’ve indented the statements in the body by two spaces. This makes +the code easier to read but does not affect how it operates.


When we call the function, the values we pass to it are assigned to +the arguments, which become variables inside the body of the +function.


Inside the function, we use the return() function to +send back the result. This return() function is optional: R +will automatically return the results of whatever command is executed on +the last line of the function.


R +



[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231

That’s not very informative. Let’s add some more arguments so we can +extract that per year and country.


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

If you’ve been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by +using the source() function:


R +


Ok, so there’s a lot going on in this function now. In plain English, +the function now subsets the provided data by year if the year argument +isn’t empty, then subsets the result by country if the country argument +isn’t empty. Then it calculates the GDP for whatever subset emerges from +the previous two steps. The function then adds the GDP as a new column +to the subsetted data and returns this as the final result. You can see +that the output is much more informative than a vector of numbers.


Let’s take a look at what happens when we specify the year:


R +

+head(calcGDP(gapminder, year=2007))


       country year      pop continent lifeExp  gdpPercap          gdp
+12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
+24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
+36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
+48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
+60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
+72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894

Or for a specific country:


R +

+calcGDP(gapminder, country="Australia")


     country year      pop continent lifeExp gdpPercap          gdp
+61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
+62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
+63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
+64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
+65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
+66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
+67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
+68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
+69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
+70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
+71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Or both:


R +

+calcGDP(gapminder, year=2007, country="Australia")


     country year      pop continent lifeExp gdpPercap          gdp
+72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Let’s walk through the body of the function:


R +

calcGDP <- function(dat, year=NULL, country=NULL) {

Here we’ve added two arguments, year, and +country. We’ve set default arguments for both as +NULL using the = operator in the function +definition. This means that those arguments will take on those values +unless the user specifies otherwise.


R +

+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }

Here, we check whether each additional argument is set to +null, and whenever they’re not null overwrite +the dataset stored in dat with a subset given by the +non-null argument.


Building these conditionals into the function makes it more flexible +for later. Now, we can use it to calculate the GDP for:

  • The whole dataset;
  • +
  • A single year;
  • +
  • A single country;
  • +
  • A single combination of year and country.
  • +

By using %in% instead, we can also give multiple years +or countries to those arguments.

+ +

Tip: Pass by value +


Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify dat inside the +function we are modifying the copy of the gapminder dataset stored in +dat, not the original variable we gave as the first +argument.


This is called “pass-by-value” and it makes writing code much safer: +you can always be sure that whatever changes you make within the body of +the function, stay inside the body of the function.

+ +

Tip: Function scope +


Another important concept is scoping: any variables (or functions!) +you create or modify inside the body of a function only exist for the +lifetime of the function’s execution. When we call +calcGDP(), the variables dat, gdp +and new only exist inside the body of the function. Even if +we have variables of the same name in our interactive R session, they +are not modified in any way when executing a function.


R +

  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

Finally, we calculated the GDP on our new subset, and created a new +data frame with that column added. This means when we call the function +later we can see the context for the returned GDP values, which is much +better than in our first attempt where we got a vector of numbers.

+ +

Challenge 4 +


Test out your GDP function by calculating the GDP for New Zealand in +1987. How does this differ from New Zealand’s GDP in 1952?

+ +

R +

+  calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand")

GDP for New Zealand in 1987: 65050008703


GDP for New Zealand in 1952: 21058193787

+ +

Challenge 5 +


The paste() function can be used to combine text +together, e.g:


R +

+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+paste(best_practice, collapse=" ")


[1] "Write programs for people not computers"

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:


R +

+fence(text=best_practice, wrapper="***")

Note: the paste() function has an argument +called sep, which specifies the separator between text. The +default is a space: ” “. The default for paste0() is no +space”“.

+ +

Write a function called fence() that takes two vectors +as arguments, called text and wrapper, and +prints out the text wrapped with the wrapper:


R +

+fence <- function(text, wrapper){
+  text <- c(wrapper, text, wrapper)
+  result <- paste(text, collapse = " ")
+  return(result)
+best_practice <- c("Write", "programs", "for", "people", "not", "computers")
+fence(text=best_practice, wrapper="***")


[1] "*** Write programs for people not computers ***"
+ +

Tip +


R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the R +Language Manual or this chapter from Advanced R Programming by Hadley +Wickham.

+ +

Tip: Testing and documenting +


It’s important to both test functions and document them: +Documentation helps you, and others, understand what the purpose of your +function is, and how to use it, and its important to make sure that your +function actually does what you think.


When you first start out, your workflow will probably look a lot like +this:

  1. Write a function
  2. +
  3. Comment parts of the function to document its behaviour
  4. +
  5. Load in the source file
  6. +
  7. Experiment with it in the console to make sure it behaves as you +expect
  8. +
  9. Make any necessary bug fixes
  10. +
  11. Rinse and repeat.
  12. +

Formal documentation for functions, written in separate +.Rd files, gets turned into the documentation you see in +help files. The roxygen2 +package allows R coders to write documentation alongside the function +code and then process it into the appropriate .Rd files. +You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In +fact, packages are, in essence, bundles of functions with this formal +documentation. Loading your own functions through +source("functions.R") is equivalent to loading someone +else’s functions (or your own one day!) through +library("package").


Formal automated tests can be written using the testthat package.

+ +

Keypoints +

  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +

Content from Writing Data


Last updated on 2023-10-26 | + + Edit this page


Estimated time 20 minutes

+ +




  • How can I save plots and data created in R?
  • +


  • To be able to write out plots and data from R.
  • +

Saving plots +


You have already seen how to save the most recent plot you create in +ggplot2, using the command ggsave. As a +refresher:


R +


You can save a plot from within RStudio using the ‘Export’ button in +the ‘Plot’ window. This will give you the option of saving as a .pdf or +as .png, .jpg or other image formats.


Sometimes you will want to save plots without creating them in the +‘Plot’ window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you’re looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can’t stop +the loop to click ‘Export’ for each one.


In this case you can use a more flexible approach. The function +pdf creates a new pdf device. You can control the size and +resolution using the arguments to this function.


R +

+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
+  geom_line() +
+  theme(legend.position = "none")
+# You then have to make sure to turn off the pdf device!

Open up this document and have a look.

+ +

Challenge 1 +


Rewrite your ‘pdf’ command to print a second page in the pdf, showing +a facet plot (hint: use facet_grid) of the same data with +one panel per continent.

+ +

R +

+pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
+p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
+  geom_line() +
+  theme(legend.position = "none")
+p + facet_grid(~continent)

The commands jpeg, png etc. are used +similarly to produce documents in different formats.


Writing data +


At some point, you’ll also want to write out data from R.


We can use the write.table function for this, which is +very similar to read.table from before.


Let’s create a data-cleaning script, for this analysis, we only want +to focus on the gapminder data for Australia:


R +

+aust_subset <- gapminder[gapminder$country == "Australia",]
+  file="cleaned-data/gapminder-aus.csv",
+  sep=","

Let’s switch back to the shell to take a look at the data to make +sure it looks OK:



head cleaned-data/gapminder-aus.csv



Hmm, that’s not quite what we wanted. Where did all these quotation +marks come from? Also the row numbers are meaningless.


Let’s look at the help file to work out how to change this +behaviour.


R +


By default R will wrap character vectors with quotation marks when +writing out to file. It will also write out the row and column +names.


Let’s fix this:


R +

+  gapminder[gapminder$country == "Australia",],
+  file="cleaned-data/gapminder-aus.csv",
+  sep=",", quote=FALSE, row.names=FALSE

Now lets look at the data again using our shell skills:



head cleaned-data/gapminder-aus.csv



That looks better!

+ +

Challenge 2 +


Write a data-cleaning script file that subsets the gapminder data to +include only data points collected since 1990.


Use this script to write out the new subset to a file in the +cleaned-data/ directory.

+ +

R +

+  gapminder[gapminder$year > 1990, ],
+  file = "cleaned-data/gapminder-after1990.csv",
+  sep = ",", quote = FALSE, row.names = FALSE
+ +

Keypoints +

  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +

Content from Splitting and Combining Data Frames with plyr


Last updated on 2023-10-26 | + + Edit this page


Estimated time 60 minutes

+ +




  • How can I do different calculations on different sets of data?
  • +


  • To be able to use the split-apply-combine strategy for data +analysis.
  • +

Previously we looked at how you can use functions to simplify your +code. We defined the calcGDP function, which takes the +gapminder dataset, and multiplies the population and GDP per capita +column. We also defined additional arguments so we could filter by +year and country:


R +

+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP <- function(dat, year=NULL, country=NULL) {
+  if(!is.null(year)) {
+    dat <- dat[dat$year %in% year, ]
+  }
+  if (!is.null(country)) {
+    dat <- dat[dat$country %in% country,]
+  }
+  gdp <- dat$pop * dat$gdpPercap
+  new <- cbind(dat, gdp=gdp)
+  return(new)

A common task you’ll encounter when working with data, is that you’ll +want to run calculations on different groups within the data. In the +above, we were calculating the GDP by multiplying two columns together. +But what if we wanted to calculated the mean GDP per continent?


We could run calcGDP and then take the mean of each +continent:


R +

+withGDP <- calcGDP(gapminder)
+mean(withGDP[withGDP$continent == "Africa", "gdp"])


[1] 20904782844

R +

+mean(withGDP[withGDP$continent == "Americas", "gdp"])


[1] 379262350210

R +

+mean(withGDP[withGDP$continent == "Asia", "gdp"])


[1] 227233738153

But this isn’t very nice. Yes, by using a function, you have +reduced a substantial amount of repetition. That is +nice. But there is still repetition. Repeating yourself will cost you +time, both now and later, and potentially introduce some nasty bugs.


We could write a new function that is flexible like +calcGDP, but this also takes a substantial amount of effort +and testing to get right.


The abstract problem we’re encountering here is know as +“split-apply-combine”:

Split apply combine

We want to split our data into groups, in this case +continents, apply some calculations on that group, then +optionally combine the results together afterwards.


The plyr package +


For those of you who have used R before, you might be familiar with +the apply family of functions. While R’s built in functions +do work, we’re going to introduce you to another method for solving the +“split-apply-combine” problem. The plyr package provides a set of +functions that we find more user friendly for solving this problem.


We installed this package in an earlier challenge. Let us load it +now:


R +


Plyr has functions for operating on lists, +data.frames and arrays (matrices, or +n-dimensional vectors). Each function performs:

  1. A splitting operation
  2. +
  3. +Apply a function on each split in turn.
  4. +
  5. Recombine output data as a single data object.
  6. +

The functions are named based on the data structure they expect as +input, and the data structure you want returned as output: [a]rray, +[l]ist, or [d]ata.frame. The first letter corresponds to the input data +structure, the second letter to the output data structure, and then the +rest of the function is named “ply”.


This gives us 9 core functions **ply. There are an additional three +functions which will only perform the split and apply steps, and not any +combine step. They’re named by their input data type and represent null +output by a _ (see table)


Note here that plyr’s use of “array” is different to R’s, an array in +ply can include a vector or matrix.

Full apply suite

Each of the xxply functions (daply, ddply, +llply, laply, …) has the same structure and +has 4 key features and structure:


R +

+xxply(.data, .variables, .fun)
  • The first letter of the function name gives the input type and the +second gives the output type.
  • +
  • .data - gives the data object to be processed
  • +
  • .variables - identifies the splitting variables
  • +
  • .fun - gives the function to be called on each piece
  • +

Now we can quickly calculate the mean GDP per continent:


R +

+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)


  continent           V1
+1    Africa  20904782844
+2  Americas 379262350210
+3      Asia 227233738153
+4    Europe 269442085301
+5   Oceania 188187105354

Let us walk through the previous code:

  • The ddply function feeds in a data.frame +(function starts with d) and returns another +data.frame (2nd letter is a d)
  • +
  • the first argument we gave was the data.frame we wanted to operate +on: in this case the gapminder data. We called calcGDP on +it first so that it would have the additional gdp column +added to it.
  • +
  • The second argument indicated our split criteria: in this case the +“continent” column. Note that we gave the name of the column, not the +values of the column like we had done previously with subsetting. Plyr +takes care of these implementation details for you.
  • +
  • The third argument is the function we want to apply to each grouping +of the data. We had to define our own short function here: each subset +of the data gets stored in x, the first argument of our +function. This is an anonymous function: we haven’t defined it +elsewhere, and it has no name. It only exists in the scope of our call +to ddply.
  • +
+ +

Challenge 1 +


Calculate the average life expectancy per continent. Which has the +longest? Which has the shortest?

+ +

R +

+ .data = gapminder,
+ .variables = "continent",
+ .fun = function(x) mean(x$lifeExp)

Oceania has the longest and Africa the shortest.


What if we want a different type of output data structure?:


R +

+ .data = calcGDP(gapminder),
+ .variables = "continent",
+ .fun = function(x) mean(x$gdp)


+[1] 20904782844
+[1] 379262350210
+[1] 227233738153
+[1] 269442085301
+[1] 188187105354
+[1] "data.frame"
+  continent
+1    Africa
+2  Americas
+3      Asia
+4    Europe
+5   Oceania

We called the same function again, but changed the second letter to +an l, so the output was returned as a list.


We can specify multiple columns to group by:


R +

+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)


   continent year           V1
+1     Africa 1952   5992294608
+2     Africa 1957   7359188796
+3     Africa 1962   8784876958
+4     Africa 1967  11443994101
+5     Africa 1972  15072241974
+6     Africa 1977  18694898732
+7     Africa 1982  22040401045
+8     Africa 1987  24107264108
+9     Africa 1992  26256977719
+10    Africa 1997  30023173824
+11    Africa 2002  35303511424
+12    Africa 2007  45778570846
+13  Americas 1952 117738997171
+14  Americas 1957 140817061264
+15  Americas 1962 169153069442
+16  Americas 1967 217867530844
+17  Americas 1972 268159178814
+18  Americas 1977 324085389022
+19  Americas 1982 363314008350
+20  Americas 1987 439447790357
+21  Americas 1992 489899820623
+22  Americas 1997 582693307146
+23  Americas 2002 661248623419
+24  Americas 2007 776723426068
+25      Asia 1952  34095762661
+26      Asia 1957  47267432088
+27      Asia 1962  60136869012
+28      Asia 1967  84648519224
+29      Asia 1972 124385747313
+30      Asia 1977 159802590186
+31      Asia 1982 194429049919
+32      Asia 1987 241784763369
+33      Asia 1992 307100497486
+34      Asia 1997 387597655323
+35      Asia 2002 458042336179
+36      Asia 2007 627513635079
+37    Europe 1952  84971341466
+38    Europe 1957 109989505140
+39    Europe 1962 138984693095
+40    Europe 1967 173366641137
+41    Europe 1972 218691462733
+42    Europe 1977 255367522034
+43    Europe 1982 279484077072
+44    Europe 1987 316507473546
+45    Europe 1992 342703247405
+46    Europe 1997 383606933833
+47    Europe 2002 436448815097
+48    Europe 2007 493183311052
+49   Oceania 1952  54157223944
+50   Oceania 1957  66826828013
+51   Oceania 1962  82336453245
+52   Oceania 1967 105958863585
+53   Oceania 1972 134112109227
+54   Oceania 1977 154707711162
+55   Oceania 1982 176177151380
+56   Oceania 1987 209451563998
+57   Oceania 1992 236319179826
+58   Oceania 1997 289304255183
+59   Oceania 2002 345236880176
+60   Oceania 2007 403657044512

R +

+ .data = calcGDP(gapminder),
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$gdp)


+continent          1952         1957         1962         1967         1972
+  Africa     5992294608   7359188796   8784876958  11443994101  15072241974
+  Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+  Asia      34095762661  47267432088  60136869012  84648519224 124385747313
+  Europe    84971341466 109989505140 138984693095 173366641137 218691462733
+  Oceania   54157223944  66826828013  82336453245 105958863585 134112109227
+          year
+continent          1977         1982         1987         1992         1997
+  Africa    18694898732  22040401045  24107264108  26256977719  30023173824
+  Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+  Asia     159802590186 194429049919 241784763369 307100497486 387597655323
+  Europe   255367522034 279484077072 316507473546 342703247405 383606933833
+  Oceania  154707711162 176177151380 209451563998 236319179826 289304255183
+          year
+continent          2002         2007
+  Africa    35303511424  45778570846
+  Americas 661248623419 776723426068
+  Asia     458042336179 627513635079
+  Europe   436448815097 493183311052
+  Oceania  345236880176 403657044512

You can use these functions in place of for loops (and +it is usually faster to do so). To replace a for loop, put the code that +was in the body of the for loop inside an anonymous +function.


R +

+  .data=gapminder,
+  .variables = "continent",
+  .fun = function(x) {
+    meanGDPperCap <- mean(x$gdpPercap)
+    print(paste(
+      "The mean GDP per capita for", unique(x$continent),
+      "is", format(meanGDPperCap, big.mark=",")
+   ))
+  }


[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+ +

Tip: printing numbers +


The format function can be used to make numeric values +“pretty” for printing out in messages.

+ +

Challenge 2 +


Calculate the average life expectancy per continent and year. Which +had the longest and shortest in 2007? Which had the greatest change in +between 1952 and 2007?

+ +

R +

+solution <- ddply(
+ .data = gapminder,
+ .variables = c("continent", "year"),
+ .fun = function(x) mean(x$lifeExp)
+solution_2007 <- solution[solution$year == 2007, ]

Oceania had the longest average life expectancy in 2007 and Africa +the lowest.


R +

+solution_1952_2007 <- cbind(solution[solution$year == 1952, ], solution_2007)
+difference_1952_2007 <- data.frame(continent = solution_1952_2007$continent,
+                                   year_1957 = solution_1952_2007[[3]],
+                                   year_2007 = solution_1952_2007[[6]],
+                                   difference = solution_1952_2007[[6]] - solution_1952_2007[[3]])

Asia had the greatest difference, and Oceania the least.

+ +

Alternate Challenge +


Without running them, which of the following will calculate the +average life expectancy per continent:

  1. +

R +

+  .data = gapminder,
+  .variables = gapminder$continent,
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = mean(dataGroup$lifeExp)
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
  1. +

R +

+  .data = gapminder,
+  .variables = "continent",
+  .fun = function(dataGroup) {
+     mean(dataGroup$lifeExp)
+  }
+ +

Answer 3 will calculate the average life expectancy per +continent.

+ +

Keypoints +

  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +

Content from Data Frame Manipulation with dplyr


Last updated on 2023-10-26 | + + Edit this page


Estimated time 55 minutes

+ +




  • How can I manipulate data frames without repeating myself?
  • +


  • To be able to use the six main data frame manipulation ‘verbs’ with +pipes in dplyr.
  • +
  • To understand how group_by() and +summarize() can be combined to summarize datasets.
  • +
  • Be able to analyze a subset of data using logical filtering.
  • +

Manipulation of data frames means many things to many researchers: we +often select certain observations (rows) or variables (columns), we +often group the data by a certain variable(s), or we even calculate +summary statistics. We can do these operations using the normal base R +operations:


R +

+mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])


[1] 2193.755

R +

+mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])


[1] 7136.11

R +

+mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])


[1] 7902.15

But this isn’t very nice because there is a fair bit of +repetition. Repeating yourself will cost you time, both now and later, +and potentially introduce some nasty bugs.


The dplyr package +


Luckily, the dplyr +package provides a number of very useful functions for manipulating data +frames in a way that will reduce the above repetition, reduce the +probability of making errors, and probably even save you some typing. As +an added bonus, you might even find the dplyr grammar +easier to read.

+ +

Tip: Tidyverse +


dplyr package belongs to a broader family of opinionated +R packages designed for data science called the “Tidyverse”. These +packages are specifically designed to work harmoniously together. Some +of these packages will be covered along this course, but you can find +more complete information here: https://www.tidyverse.org/.


Here we’re going to cover 5 of the most commonly used functions as +well as using pipes (%>%) to combine them.

  1. select()
  2. +
  3. filter()
  4. +
  5. group_by()
  6. +
  7. summarize()
  8. +
  9. mutate()
  10. +

If you have have not installed this package earlier, please do +so:


R +


Now let’s load the package:


R +


Using select() +


If, for example, we wanted to move forward with only a few of the +variables in our data frame we could use the select() +function. This will keep only the variables you select.


R +

+year_country_gdp <- select(gapminder, year, country, gdpPercap)

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.


R +

+smaller_gapminder_data <- select(gapminder, -continent)

If we open up year_country_gdp we’ll see that it only +contains the year, country and gdpPercap. Above we used ‘normal’ +grammar, but the strengths of dplyr lie in combining +several functions using pipes. Since the pipes grammar is unlike +anything we’ve seen in R before, let’s repeat what we’ve done above +using pipes.


R +

+year_country_gdp <- gapminder %>% select(year, country, gdpPercap)

To help you understand why we wrote that in that way, let’s walk +through it step by step. First we summon the gapminder data frame and +pass it on, using the pipe symbol %>%, to the next step, +which is the select() function. In this case we don’t +specify which data object we use in the select() function +since in gets that from the previous pipe. Fun Fact: +There is a good chance you have encountered pipes before in the shell. +In R, a pipe symbol is %>% while in the shell it is +| but the concept is the same!

+ +

Tip: Renaming data frame columns in dplyr +


In Chapter 4 we covered how you can rename columns with base R by +assigning a value to the output of the names() function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a +rename() function.


Within a pipeline, the syntax is +rename(new_name = old_name). For example, we may want to +rename the gdpPercap column name from our select() +statement above.


R +

+tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)


  year     country gdp_per_capita
+1 1952 Afghanistan       779.4453
+2 1957 Afghanistan       820.8530
+3 1962 Afghanistan       853.1007
+4 1967 Afghanistan       836.1971
+5 1972 Afghanistan       739.9811
+6 1977 Afghanistan       786.1134

Using filter() +


If we now want to move forward with the above, but only with European +countries, we can combine select and +filter


R +

+year_country_gdp_euro <- gapminder %>%
+    filter(continent == "Europe") %>%
+    select(year, country, gdpPercap)

If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below.


R +

+europe_lifeExp_2007 <- gapminder %>%
+  filter(continent == "Europe", year == 2007) %>%
+  select(country, lifeExp)
+ +

Challenge 1 +


Write a single command (which can span multiple lines and includes +pipes) that will produce a data frame that has the African values for +lifeExp, country and year, but +not for other Continents. How many rows does your data frame have and +why?

+ +

R +

+year_country_lifeExp_Africa <- gapminder %>%
+                           filter(continent == "Africa") %>%
+                           select(year, country, lifeExp)

As with last time, first we pass the gapminder data frame to the +filter() function, then we pass the filtered version of the +gapminder data frame to the select() function. +Note: The order of operations is very important in this +case. If we used ‘select’ first, filter would not be able to find the +variable continent since we would have removed it in the previous +step.


Using group_by() +


Now, we were supposed to be reducing the error prone repetitiveness +of what can be done with base R, but up to now we haven’t done that +since we would have to repeat the above for each continent. Instead of +filter(), which will only pass observations that meet your +criteria (in the above: continent=="Europe"), we can use +group_by(), which will essentially use every unique +criteria that you could have used in filter.


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...

R +

+str(gapminder %>% group_by(continent))


gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
+  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
+  ..$ .rows    : list<int> [1:5] 
+  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
+  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
+  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
+  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
+  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
+  .. ..@ ptype: int(0) 
+  ..- attr(*, ".drop")= logi TRUE

You will notice that the structure of the data frame where we used +group_by() (grouped_df) is not the same as the +original gapminder (data.frame). A +grouped_df can be thought of as a list where +each item in the listis a data.frame which +contains only the rows that correspond to the a particular value +continent (at least in the example above).

Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize() +


The above was a bit on the uneventful side but +group_by() is much more exciting in conjunction with +summarize(). This will allow us to create new variable(s) +by using functions that repeat for each of the continent-specific data +frames. That is to say, using the group_by() function, we +split our original data frame into multiple pieces, then we can run +functions (e.g. mean() or sd()) within +summarize().


R +

+gdp_bycontinents <- gapminder %>%
+    group_by(continent) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))
Diagram illustrating the use of group by and summarize together to create a new variable

R +

continent mean_gdpPercap
+     <fctr>          <dbl>
+1    Africa       2193.755
+2  Americas       7136.110
+3      Asia       7902.150
+4    Europe      14469.476
+5   Oceania      18621.609

That allowed us to calculate the mean gdpPercap for each continent, +but it gets even better.

+ +

Challenge 2 +


Calculate the average life expectancy per country. Which has the +longest average life expectancy and which has the shortest average life +expectancy?

+ +

R +

+lifeExp_bycountry <- gapminder %>%
+   group_by(country) %>%
+   summarize(mean_lifeExp = mean(lifeExp))
+lifeExp_bycountry %>%
+   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))


# A tibble: 2 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Iceland              76.5
+2 Sierra Leone         36.8

Another way to do this is to use the dplyr function +arrange(), which arranges the rows in a data frame +according to the order of one or more variables from the data frame. It +has similar syntax to other functions from the dplyr +package. You can use desc() inside arrange() +to sort in descending order.


R +

+lifeExp_bycountry %>%
+   arrange(mean_lifeExp) %>%
+   head(1)


# A tibble: 1 × 2
+  country      mean_lifeExp
+  <chr>               <dbl>
+1 Sierra Leone         36.8

R +

+lifeExp_bycountry %>%
+   arrange(desc(mean_lifeExp)) %>%
+   head(1)


# A tibble: 1 × 2
+  country mean_lifeExp
+  <chr>          <dbl>
+1 Iceland         76.5

Alphabetical order works too


R +

+lifeExp_bycountry %>%
+   arrange(desc(country)) %>%
+   head(1)


# A tibble: 1 × 2
+  country  mean_lifeExp
+  <chr>           <dbl>
+1 Zimbabwe         52.7

The function group_by() allows us to group by multiple +variables. Let’s group by year and +continent.


R +

+gdp_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

That is already quite powerful, but it gets even better! You’re not +limited to defining 1 new variable in summarize().


R +

+gdp_pop_bycontinents_byyear <- gapminder %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

count() and n() +


A very common operation is to count the number of observations for +each group. The dplyr package comes with two related +functions that help with this.


For instance, if we wanted to check the number of countries included +in the dataset for the year 2002, we can use the count() +function. It takes the name of one or more columns that contain the +groups we are interested in, and we can optionally sort the results in +descending order by adding sort=TRUE:


R +

+gapminder %>%
+    filter(year == 2002) %>%
+    count(continent, sort = TRUE)


  continent  n
+1    Africa 52
+2      Asia 33
+3    Europe 30
+4  Americas 25
+5   Oceania  2

If we need to use the number of observations in calculations, the +n() function is useful. It will return the total number of +observations in the current group rather than counting the number of +observations in each group within a specific column. For instance, if we +wanted to get the standard error of the life expectency per +continent:


R +

+gapminder %>%
+    group_by(continent) %>%
+    summarize(se_le = sd(lifeExp)/sqrt(n()))


# A tibble: 5 × 2
+  continent se_le
+  <chr>     <dbl>
+1 Africa    0.366
+2 Americas  0.540
+3 Asia      0.596
+4 Europe    0.286
+5 Oceania   0.775

You can also chain together several summary operations; in this case +calculating the minimum, maximum, +mean and se of each continent’s per-country +life-expectancy:


R +

+gapminder %>%
+    group_by(continent) %>%
+    summarize(
+      mean_le = mean(lifeExp),
+      min_le = min(lifeExp),
+      max_le = max(lifeExp),
+      se_le = sd(lifeExp)/sqrt(n()))


# A tibble: 5 × 5
+  continent mean_le min_le max_le se_le
+  <chr>       <dbl>  <dbl>  <dbl> <dbl>
+1 Africa       48.9   23.6   76.4 0.366
+2 Americas     64.7   37.6   80.7 0.540
+3 Asia         60.1   28.8   82.6 0.596
+4 Europe       71.9   43.6   81.8 0.286
+5 Oceania      74.3   69.1   81.2 0.775

Using mutate() +


We can also create new variables prior to (or even after) summarizing +information using mutate().


R +

+gdp_pop_bycontinents_byyear <- gapminder %>%
+    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
+    group_by(continent,year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

Connect mutate with logical filtering: ifelse +


When creating new variables, we can hook this with a logical +condition. A simple combination of mutate() and +ifelse() facilitates filtering right where it is needed: in +the moment of creating something new. This easy-to-read statement is a +fast and powerful way of discarding certain data (even though the +overall dimension of the data frame will not change) or for updating +values depending on this given condition.


R +

+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
+    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              sd_gdpPercap = sd(gdpPercap),
+              mean_pop = mean(pop),
+              sd_pop = sd(pop),
+              mean_gdp_billion = mean(gdp_billion),
+              sd_gdp_billion = sd(gdp_billion))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

R +

+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
+    mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
+    group_by(continent, year) %>%
+    summarize(mean_gdpPercap = mean(gdpPercap),
+              mean_gdpPercap_expected = mean(gdp_futureExpectation))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.

Combining dplyr and ggplot2 +


First install and load ggplot2:


R +


R +


In the plotting lesson we looked at how to make a multi-panel figure +by adding a layer of facet panels using ggplot2. Here is +the code we used (with some extra comments):


R +

+# Filter countries located in the Americas
+americas <- gapminder[gapminder$continent == "Americas", ]
+# Make the plot
+ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

This code makes the right plot but it also creates an intermediate +variable (americas) that we might not have any other uses +for. Just as we used %>% to pipe data along a chain of +dplyr functions we can use it to pass data to +ggplot(). Because %>% replaces the first +argument in a function we don’t need to specify the data = +argument in the ggplot() function. By combining +dplyr and ggplot2 functions we can make the +same figure without creating any new variables or modifying the +data.


R +

+gapminder %>%
+  # Filter countries located in the Americas
+  filter(continent == "Americas") %>%
+  # Make the plot
+  ggplot(mapping = aes(x = year, y = lifeExp)) +
+  geom_line() +
+  facet_wrap( ~ country) +
+  theme(axis.text.x = element_text(angle = 45))

More examples of using the function mutate() and the +ggplot2 package.


R +

+gapminder %>%
+  # extract first letter of country name into new column
+  mutate(startsWith = substr(country, 1, 1)) %>%
+  # only keep countries starting with A or Z
+  filter(startsWith %in% c("A", "Z")) %>%
+  # plot lifeExp into facets
+  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
+  geom_line() +
+  facet_wrap(vars(country)) +
+  theme_minimal()
+ +

Advanced Challenge +


Calculate the average life expectancy in 2002 of 2 randomly selected +countries for each continent. Then arrange the continent names in +reverse order. Hint: Use the dplyr +functions arrange() and sample_n(), they have +similar syntax to other dplyr functions.

+ +

R +

+lifeExp_2countries_bycontinents <- gapminder %>%
+   filter(year==2002) %>%
+   group_by(continent) %>%
+   sample_n(2) %>%
+   summarize(mean_lifeExp=mean(lifeExp)) %>%
+   arrange(desc(mean_lifeExp))

Other great resources +

+ +
+ +

Keypoints +

  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +

Content from Data Frame Manipulation with tidyr


Last updated on 2023-10-26 | + + Edit this page


Estimated time 45 minutes

+ +




  • How can I change the layout of a data frame?
  • +


  • To understand the concepts of ‘longer’ and ‘wider’ data frame +formats and be able to convert between them with +tidyr.
  • +

Researchers often want to reshape their data frames from ‘wide’ to +‘longer’ layouts, or vice-versa. The ‘long’ layout or format is +where:

  • each column is a variable
  • +
  • each row is an observation
  • +

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column +for the observed variable and the other columns are ID variables.


For the ‘wide’ format each row is often a site/subject/patient and +you have multiple observation variables containing the same type of +data. These can be either repeated observations over time, or +observation of multiple variables (or a mix of both). You may find data +input may be simpler or some other applications may prefer the ‘wide’ +format. However, many of R‘s functions have been designed +assuming you have ’longer’ formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format.

Diagram illustrating the difference between a wide versus long layout of a data frame

Long and wide data frame layouts mainly affect readability. For +humans, the wide format is often more intuitive since we can often see +more of the data on the screen due to its shape. However, the long +format is more machine readable and is closer to the formatting of +databases. The ID variables in our data frames are similar to the fields +in a database and observed variables are like the database values.


Getting started +


First install the packages if you haven’t already done so (you +probably installed dplyr in the previous lesson):


R +


Load the packages


R +


First, lets look at the structure of our original gapminder data +frame:


R +



'data.frame':	1704 obs. of  6 variables:
+ $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num  779 821 853 836 740 ...
+ +

Challenge 1 +


Is gapminder a purely long, purely wide, or some intermediate +format?

+ +

The original gapminder data.frame is in an intermediate format. It is +not purely long since it had multiple observation variables +(pop,lifeExp,gdpPercap).


Sometimes, as with the gapminder dataset, we have multiple types of +observed data. It is somewhere in between the purely ‘long’ and ‘wide’ +data formats. We have 3 “ID variables” (continent, +country, year) and 3 “Observation variables” +(pop,lifeExp,gdpPercap). This +intermediate format can be preferred despite not having ALL observations +in 1 column given that all 3 observation variables have different units. +There are few operations that would need us to make this data frame any +longer (i.e. 4 ID variables and 1 Observation variable).


While using many of the functions in R, which are often vector based, +you usually do not want to do mathematical operations on values with +different units. For example, using the purely long format, a single +mean for all of the values of population, life expectancy, and GDP would +not be meaningful since it would return the mean of values with 3 +incompatible units. The solution is that we first manipulate the data +either by grouping (see the lesson on dplyr), or we change +the structure of the data frame. Note: Some plotting +functions in R actually work better in the wide format data.


From wide to long format with pivot_longer() +


Until now, we’ve been using the nicely formatted original gapminder +dataset, but ‘real’ data (i.e. our own research data) will never be so +well organized. Here let’s start with the wide formatted version of the +gapminder dataset.


Download the wide version of the gapminder data from here and save it in your data +folder.


We’ll load the data file and look at it. Note: we don’t want our +continent and country columns to be factors, so we use the +stringsAsFactors argument for read.csv() to disable +that.


R +

+gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)


'data.frame':	142 obs. of  38 variables:
+ $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num  3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num  2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
+ $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
Diagram illustrating the wide format of the gapminder data frame

To change this very wide data frame layout back to our nice, +intermediate (or longer) layout, we will use one of the two available +pivot functions from the tidyr package. To +convert from wide to a longer format, we will use the +pivot_longer() function. pivot_longer() makes +datasets longer by increasing the number of rows and decreasing the +number of columns, or ‘lengthening’ your observation variables into a +single variable.

Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format

R +

+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
+    names_to = "obstype_year", values_to = "obs_values"
+  )


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
+ $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...

Here we have used piping syntax which is similar to what we were +doing in the previous lesson with dplyr. In fact, these are compatible +and you can use a mix of tidyr and dplyr functions by piping them +together.


We first provide to pivot_longer() a vector of column +names that will be pivoted into longer format. We could type out all the +observation variables, but as in the select() function (see +dplyr lesson), we can use the starts_with() +argument to select all variables that start with the desired character +string. pivot_longer() also allows the alternative syntax +of using the - symbol to identify which variables are not +to be pivoted (i.e. ID variables).


The next arguments to pivot_longer() are +names_to for naming the column that will contain the new ID +variable (obstype_year) and values_to for +naming the new amalgamated observation variable +(obs_value). We supply these new column names as +strings.

Diagram illustrating the long format of the gapminder data

R +

+gap_long <- gap_wide %>%
+  pivot_longer(
+    cols = c(-continent, -country),
+    names_to = "obstype_year", values_to = "obs_values"
+  )


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
+ $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
+ $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values  : num [1:5112] 2449 3014 2551 3247 4183 ...

That may seem trivial with this particular data frame, but sometimes +you have 1 ID variable and 40 observation variables with irregular +variable names. The flexibility is a huge time saver!


Now obstype_year actually contains 2 pieces of +information, the observation type +(pop,lifeExp, or gdpPercap) and +the year. We can use the separate() function +to split the character strings into multiple variables


R +

+gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
+gap_long$year <- as.integer(gap_long$year)
+ +

Challenge 2 +


Using gap_long, calculate the mean life expectancy, +population, and gdpPercap for each continent. Hint: use +the group_by() and summarize() functions we +learned in the dplyr lesson

+ +

R +

+gap_long %>% group_by(continent, obs_type) %>%
+   summarize(means=mean(obs_values))


`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.


# A tibble: 15 × 3
+# Groups:   continent [5]
+   continent obs_type       means
+   <chr>     <chr>          <dbl>
+ 1 Africa    gdpPercap     2194. 
+ 2 Africa    lifeExp         48.9
+ 3 Africa    pop        9916003. 
+ 4 Americas  gdpPercap     7136. 
+ 5 Americas  lifeExp         64.7
+ 6 Americas  pop       24504795. 
+ 7 Asia      gdpPercap     7902. 
+ 8 Asia      lifeExp         60.1
+ 9 Asia      pop       77038722. 
+10 Europe    gdpPercap    14469. 
+11 Europe    lifeExp         71.9
+12 Europe    pop       17169765. 
+13 Oceania   gdpPercap    18622. 
+14 Oceania   lifeExp         74.3
+15 Oceania   pop        8874672. 

From long to intermediate format with pivot_wider() +


It is always good to check work. So, let’s use the second +pivot function, pivot_wider(), to ‘widen’ our +observation variables back out. pivot_wider() is the +opposite of pivot_longer(), making a dataset wider by +increasing the number of columns and decreasing the number of rows. We +can use pivot_wider() to pivot or reshape our +gap_long to the original intermediate format or the widest +format. Let’s start with the intermediate format.


The pivot_wider() function takes names_from +and values_from arguments.


To names_from we supply the column name whose contents +will be pivoted into new output columns in the widened data frame. The +corresponding values will be added from the column named in the +values_from argument.


R +

+gap_normal <- gap_long %>%
+  pivot_wider(names_from = obs_type, values_from = obs_values)


[1] 1704    6

R +



[1] 1704    6

R +



[1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"      

R +



[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

Now we’ve got an intermediate data frame gap_normal with +the same dimensions as the original gapminder, but the +order of the variables is different. Let’s fix that before checking if +they are all.equal().


R +

+gap_normal <- gap_normal[, names(gapminder)]
+all.equal(gap_normal, gapminder)


[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
+[3] "Component \"country\": 1704 string mismatches"                                         
+[4] "Component \"pop\": Mean relative difference: 1.634504"                                 
+[5] "Component \"continent\": 1212 string mismatches"                                       
+[6] "Component \"lifeExp\": Mean relative difference: 0.203822"                             
+[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"                           

R +



# A tibble: 6 × 6
+  country  year      pop continent lifeExp gdpPercap
+  <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>
+1 Algeria  1952  9279525 Africa       43.1     2449.
+2 Algeria  1957 10270856 Africa       45.7     3014.
+3 Algeria  1962 11000948 Africa       48.3     2551.
+4 Algeria  1967 12760499 Africa       51.4     3247.
+5 Algeria  1972 14760787 Africa       54.5     4183.
+6 Algeria  1977 17152804 Africa       58.0     4910.

R +



      country year      pop continent lifeExp gdpPercap
+1 Afghanistan 1952  8425333      Asia  28.801  779.4453
+2 Afghanistan 1957  9240934      Asia  30.332  820.8530
+3 Afghanistan 1962 10267083      Asia  31.997  853.1007
+4 Afghanistan 1967 11537966      Asia  34.020  836.1971
+5 Afghanistan 1972 13079460      Asia  36.088  739.9811
+6 Afghanistan 1977 14880372      Asia  38.438  786.1134

We’re almost there, the original was sorted by country, +then year.


R +

+gap_normal <- gap_normal %>% arrange(country, year)
+all.equal(gap_normal, gapminder)


[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

That’s great! We’ve gone from the longest format back to the +intermediate and we didn’t introduce any errors in our code.


Now let’s convert the long all the way back to the wide. In the wide +format, we will keep country and continent as ID variables and pivot the +observations across the 3 metrics +(pop,lifeExp,gdpPercap) and time +(year). First we need to create appropriate labels for all +our new variables (time*metric combinations) and we also need to unify +our ID variables to simplify the process of defining +gap_wide.


R +

+gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")


tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
+ $ var_ID    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ obs_type  : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
+ $ year      : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

R +

+gap_temp <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_")


tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
+ $ ID_var    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
+ $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
+ $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

Using unite() we now have a single ID variable which is +a combination of continent,country,and we have +defined variable names. We’re now ready to pipe in +pivot_wider()


R +

+gap_wide_new <- gap_long %>%
+  unite(ID_var, continent, country, sep = "_") %>%
+  unite(var_names, obs_type, year, sep = "_") %>%
+  pivot_wider(names_from = var_names, values_from = obs_values)


tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
+ $ ID_var        : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...
+ +

Challenge 3 +


Take this 1 step further and create a +gap_ludicrously_wide format data by pivoting over +countries, year and the 3 metrics? Hint this new data +frame should only have 5 rows.

+ +

R +

+gap_ludicrously_wide <- gap_long %>%
+   unite(var_names, obs_type, year, country, sep = "_") %>%
+   pivot_wider(names_from = var_names, values_from = obs_values)

Now we have a great ‘wide’ format data frame, but the +ID_var could be more usable, let’s separate it into 2 +variables with separate()


R +

+gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
+gap_wide_betterID <- gap_long %>%
+    unite(ID_var, continent, country, sep = "_") %>%
+    unite(var_names, obs_type, year, sep = "_") %>%
+    pivot_wider(names_from = var_names, values_from = obs_values) %>%
+    separate(ID_var, c("continent","country"), sep = "_")


tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
+ $ continent     : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
+ $ country       : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
+ $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
+ $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
+ $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
+ $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
+ $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
+ $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
+ $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
+ $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
+ $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
+ $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
+ $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
+ $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
+ $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
+ $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
+ $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
+ $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
+ $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
+ $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
+ $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
+ $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
+ $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
+ $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
+ $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
+ $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
+ $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
+ $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
+ $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
+ $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
+ $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
+ $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
+ $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
+ $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
+ $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
+ $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
+ $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
+ $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...

R +

+all.equal(gap_wide, gap_wide_betterID)


[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
+[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

There and back again!


Other great resources +

+ +
+ +

Keypoints +

  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +

Content from Producing Reports With knitr


Last updated on 2023-10-26 | + + Edit this page


Estimated time 75 minutes

+ +




  • How can I integrate software and reports?
  • +


  • Understand the value of writing reproducible reports
  • +
  • Learn how to recognise and compile the basic components of an R +Markdown file
  • +
  • Become familiar with R code chunks, and understand their purpose, +structure and options
  • +
  • Demonstrate the use of inline chunks for weaving R outputs into text +blocks, for example when discussing the results of some +calculations
  • +
  • Be aware of alternative output formats to which an R Markdown file +can be exported
  • +

Data analysis reports +


Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their work +for future reference.


Many new users begin by first writing a single R script containing +all of their work, and then share the analysis by emailing the script +and various graphs as attachments. But this can be cumbersome, requiring +a lengthy discussion to explain which attachment was which result.


Writing formal reports with Word or LaTeX can simplify this +process by incorporating both the analysis report and output graphs into +a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy +“whack-a-mole” game of fixing new mistakes resulting from a single +formatting change.


Creating a report as a web page (which is an html file) using R +Markdown makes things easier. The report can be one long stream, so tall +figures that wouldn’t ordinarily fit on one page can be kept at full +size and easier to read, since the reader can simply keep scrolling. +Additionally, the formatting of and R Markdown document is simple and +easy to modify, allowing you to spend more time on your analyses instead +of writing reports.


Literate programming +


Ideally, such analysis reports are reproducible documents: +If an error is discovered, or if some additional subjects are added to +the data, you can just re-compile the report and get the new or +corrected results rather than having to reconstruct figures, paste them +into a Word document, and hand-edit various detailed results.


The key R package here is knitr. It allows you +to create a document that is a mixture of text and chunks of code. When +the document is processed by knitr, chunks of code will be +executed, and graphs or other results will be inserted into the final +document.


This sort of idea has been called “literate programming”.


knitr allows you to mix basically any type of text with +code from different programming languages, but we recommend that you use +R Markdown, which mixes Markdown with R. Markdown is a light-weight +mark-up language for creating web pages.


Creating an R Markdown file +


Within RStudio, click File → New File → R Markdown and you’ll get a +dialog box like this:

Screenshot of the New R Markdown file dialogue box in RStudio

You can stick with the default (HTML output), but give it a +title.


Basic components of R Markdown +


The initial chunk of text (header) contains instructions for R to +specify what kind of document will be created, and the options chosen. +You can use the header to give your document a title, author, date, and +tell it what type of output you want to produce. In this case, we’re +creating an html document.

+title: "Initial R Markdown document"
+author: "Karl Broman"
+date: "April 23, 2015"
+output: html_document

You can delete any of those fields if you don’t want them included. +The double-quotes aren’t strictly necessary in this case. +They’re mostly needed if you want to include a colon in the title.


RStudio creates the document with some example text to get you +started. Note below that there are chunks like


These are chunks of R code that will be executed by +knitr and replaced by their results. More on this +later.


Markdown +


Markdown is a system for writing web pages by marking up the text +much as you would in an email rather than writing html code. The +marked-up text gets converted to html, replacing the marks with +the proper html code.


For now, let’s delete all of the stuff that’s there and write a bit +of markdown.


You make things bold using two asterisks, like this: +**bold**, and you make things italics by using +underscores, like this: _italics_.


You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this:

A list:
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks

or like this:

A second list:
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks

Each will appear as:

  • bold with double-asterisks
  • +
  • italics with underscores
  • +
  • code-type font with backticks
  • +

You can use whatever method you prefer, but be consistent. +This maintains the readability of your code.


You can make a numbered list by just using numbers. You can even use +the same number over and over if you want:

1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks

This will appear as:

  1. bold with double-asterisks
  2. +
  3. italics with underscores
  4. +
  5. code-type font with backticks
  6. +

You can make section headers of different sizes by initiating a line +with some number of # symbols:

# Title
+## Main section
+### Sub-section
+#### Sub-sub section

You compile the R Markdown document to an html webpage by +clicking the “Knit” button in the upper-left.

+ +

Challenge 1 +


Create a new R Markdown document. Delete all of the R code chunks and +write a bit of Markdown (some sections, some italicized text, and an +itemized list).


Convert the document to a webpage.

+ +

In RStudio, select File > New file > R Markdown…


Delete the placeholder text and add the following:

# Introduction
+## Background on Data
+This report uses the *gapminder* dataset, which has columns that include:
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+## Background on Methods

Then click the ‘Knit’ button on the toolbar to generate an html +document (webpage).


A bit more Markdown +


You can make a hyperlink like this: +[Carpentries Home Page](https://carpentries.org/).


You can include an image file like this: +![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)


You can do subscripts (e.g., F2) with F~2~ +and superscripts (e.g., F2) with F^2^.


If you know how to write equations in LaTeX, you can use +$ $ and $$ $$ to insert math equations, like +$E = mc^2$ and

$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$

You can review Markdown syntax by navigating to the “Markdown Quick +Reference” under the “Help” field in the toolbar at the top of +RStudio.


R code chunks +


The real power of Markdown comes from mixing markdown with chunks of +code. This is R Markdown. When processed, the R code will be executed; +if they produce figures, the figures will be inserted in the final +document.


The main code chunks look like this:

+```{r load_data}

That is, you place a chunk of R code between ```{r +chunk_name} and ```. You should give each chunk a +unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the +shortcuts Ctrl+Alt+I on Windows and +Linux, or Cmd+Option+I on Mac.

+ +

Challenge 2 +


Add code chunks to:

  • Load the ggplot2 package
  • +
  • Read the gapminder data
  • +
  • Create a plot
  • +
+ +
+```{r load-ggplot2}
+```{r read-gapminder-data}
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)

How things get compiled +


When you press the “Knit” button, the R Markdown document is +processed by knitr +and a plain Markdown document is produced (as well as, potentially, a +set of figure files): the R code is executed and replaced by both the +input and the output; if figures are produced, links to those figures +are included.


The Markdown and figure documents are then processed by the tool pandoc, which converts the +Markdown file into an html file, with the figures embedded.


Chunk options +


There are a variety of options to affect how the code chunks are +treated. Here are some examples:

  • Use echo=FALSE to avoid having the code itself +shown.
  • +
  • Use results="hide" to avoid having any results +printed.
  • +
  • Use eval=FALSE to have the code shown but not +evaluated.
  • +
  • Use warning=FALSE and message=FALSE to +hide any warnings or messages produced.
  • +
  • Use fig.height and fig.width to control +the size of the figures produced (in inches).
  • +

So you might write:

+```{r load_libraries, echo=FALSE, message=FALSE}

Often there will be particular options that you’ll want to use +repeatedly; for this, you can set global chunk options, like +so:

+```{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)

The fig.path option defines where the figures will be +saved. The / here is really important; without it, the +figures would be saved in the standard place but just with names that +begin with Figs.


If you have multiple R Markdown files in a common directory, you +might want to use fig.path to define separate prefixes for +the figure file names, like fig.path="Figs/cleaning-" and +fig.path="Figs/analysis-".

+ +

Challenge 3 +


Use chunk options to control the size of a figure and to hide the +code.

+ +
+```{r echo = FALSE, fig.width = 3}

You can review all of the R chunk options by navigating +to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the +“Help” field in the toolbar at the top of RStudio.


Inline R code +


You can make every number in your report reproducible. Use +`r and ` for an in-line code chunk, like so: +`r round(some_value, 2)`. The code will be executed and +replaced with the value of the result.


Don’t let these in-line chunks get split across lines.


Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with include=FALSE for +that larger chunk (which is the same as echo=FALSE and +results="hide").


Rounding can produce differences in output in such situations. You +may want 2.0, but round(2.03, 1) will give +just 2.


The myround +function in the R/broman +package handles this.

+ +

Challenge 4 +


Try out a bit of in-line R code.

+ +

Here’s some inline code to determine that 2 + 2 = 4.


Other output options +


You can also convert R Markdown to a PDF or a Word document. Click +the little triangle next to the “Knit” button to get a drop-down menu. +Or you could put pdf_document or word_document +in the initial header of the file.

+ +

Tip: Creating PDF documents +


Creating .pdf documents may require installation of some extra +software. The R package tinytex provides some tools to help +make this process easier for R users. With tinytex +installed, run tinytex::install_tinytex() to install the +required software (you’ll only need to do this once) and then when you +knit to pdf tinytex will automatically detect and install +any additional LaTeX packages that are needed to produce the pdf +document. Visit the tinytex +website for more information.

+ +

Tip: Visual markdown editing in RStudio +


RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.


Resources +

+ +
+ +

Keypoints +

  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +

Content from Writing Good Software


Last updated on 2023-10-26 | + + Edit this page


Estimated time 15 minutes

+ +




  • How can I write software that other people can use?
  • +


  • Describe best practices for writing R and explain the justification +for each.
  • +

Structure your project folder +


Keep your project folder structured, organized and tidy, by creating +subfolders for your code files, manuals, data, binaries, output plots, +etc. It can be done completely manually, or with the help of RStudio’s +New Project functionality, or a designated package, such as +ProjectTemplate.

+ +

Tip: ProjectTemplate - a possible +solution +


One way to automate the management of projects is to install the +third-party package, ProjectTemplate. This package will set +up an ideal directory structure for project management. This is very +useful as it enables you to have your analysis pipeline/workflow +organised and structured. Together with the default RStudio project +functionality and Git you will be able to keep track of your work as +well as be able to share your work with collaborators.

  1. Install ProjectTemplate.
  2. +
  3. Load the library
  4. +
  5. Initialise the project:
  6. +

R +

+create.project("../my_project_2", merge.strategy = "allow.non.conflict")

For more information on ProjectTemplate and its functionality visit +the home page ProjectTemplate


Make code readable +


The most important part of writing code is making it readable and +understandable. You want someone else to be able to pick up your code +and be able to understand what it does: more often than not this someone +will be you 6 months down the line, who will otherwise be cursing +past-self.


Documentation: tell us what and why, not how +


When you first start out, your comments will often describe what a +command does, since you’re still learning yourself and it can help to +clarify concepts and remind you later. However, these comments aren’t +particularly useful later on when you don’t remember what problem your +code is trying to solve. Try to also include comments that tell you +why you’re solving a problem, and what problem that +is. The how can come after that: it’s an implementation detail +you ideally shouldn’t have to worry about.


Keep your code modular +


Our recommendation is that you should separate your functions from +your analysis scripts, and store them in a separate file that you +source when you open the R session in your project. This +approach is nice because it leaves you with an uncluttered analysis +script, and a repository of useful functions that can be loaded into any +analysis script in your project. It also lets you group related +functions together easily.


Break down problem into bite size pieces +


When you first start out, problem solving and function writing can be +daunting tasks, and hard to separate from code inexperience. Try to +break down your problem into digestible chunks and worry about the +implementation details later: keep breaking down the problem into +smaller and smaller functions until you reach a point where you can code +a solution, and build back up from there.


Know that your code is doing the right thing +


Make sure to test your functions!


Don’t repeat yourself +


Functions enable easy reuse within a project. If you see blocks of +similar lines of code through your project, those are usually candidates +for being moved into functions.


If your calculations are performed through a series of functions, +then the project becomes more modular and easier to change. This is +especially the case for which a particular input always gives a +particular output.


Remember to be stylish +


Apply consistent style to your code.

+ +

Keypoints +

  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+ + +
+ + +
+ + + + + diff --git a/instructor/discuss.html b/instructor/discuss.html new file mode 100644 index 000000000..657bdfa46 --- /dev/null +++ b/instructor/discuss.html @@ -0,0 +1,445 @@ + +R for Reproducible Scientific Analysis: Discussion +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +



Last updated on 2023-10-26 | + + Edit this page

+ + + + + +
+ +
+ + +

Please see our other R +lesson for a different presentation of these concepts.

+ + +
+ + +
+ + + diff --git a/instructor/images.html b/instructor/images.html new file mode 100644 index 000000000..0608da5f0 --- /dev/null +++ b/instructor/images.html @@ -0,0 +1,643 @@ + + + + + +R for Reproducible Scientific Analysis: All Images + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + + +
+ + +

Introduction to R and RStudio


Figure 1

+ +
RStudio layout


Figure 2

+ +
RStudio layout with .R file open

Project Management With RStudio


Figure 1

+ +
Screenshot of file manager demonstrating bad project organisation

Seeking Help


Data Structures


Exploring Data Frames


Subsetting Data


Figure 1

+ +
Inequality testing


Figure 2

+ +
Inequality testing: results of recycling

Control Flow


Creating Publication-Quality Graphics with ggplot2


Figure 1

+ +
Blank plot, before adding any mapping aesthetics to ggplot().


Figure 2

+ +
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.


Figure 3

+ +
Scatter plot of life expectancy vs GDP per capita, now showing the data points.


Figure 4

+ +
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
+Binned scatterplot of life expectancy versus year showing how life +expectancy has increased over time +


Figure 5

+ +
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
+Binned scatterplot of life expectancy vs year with color-coded +continents showing value of ‘aes’ function +


Figure 6



Figure 7



Figure 8



Figure 9



Figure 10

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.


Figure 11



Figure 12

+ +
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
+Scatterplot of GDP vs life expectancy showing logarithmic x-axis data +spread +


Figure 13

+ +
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.


Figure 14

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.


Figure 15

+ +
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.


Figure 16



Figure 17



Figure 18



Figure 19




Figure 1

+ +
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.


Figure 2

+ +
Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled.

Functions Explained


Writing Data


Splitting and Combining Data Frames with plyr


Figure 1

+ +
Split apply combine


Figure 2

+ +
Full apply suite

Data Frame Manipulation with dplyr


Figure 1

+ +

Diagram illustrating use of select function to select two columns of a data frame +If we want to remove one column only from the gapminder +data, for example, removing the continent column.


Figure 2

+ +
Diagram illustrating how the group by function oraganizes a data frame into groups


Figure 3

+ +
Diagram illustrating the use of group by and summarize together to create a new variable


Figure 4



Figure 5



Figure 6


Data Frame Manipulation with tidyr


Figure 1

+ +
Diagram illustrating the difference between a wide versus long layout of a data frame


Figure 2

+ +
Diagram illustrating the wide format of the gapminder data frame


Figure 3

+ +
Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format


Figure 4

+ +
Diagram illustrating the long format of the gapminder data

Producing Reports With knitr


Figure 1

+ +
Screenshot of the New R Markdown file dialogue box in RStudio


Figure 2



Figure 3


RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like +**bold words**) are transformed to the formatted appearance +(bold words) as you type. This mode also includes a +toolbar at the top with basic formatting buttons, similar to what you +might see in common word processing software programs. You can turn +visual editing on and off by pressing the button in the top right corner of your +R Markdown document.


Writing Good Software

+ + +
+ + +
+ + + + + diff --git a/instructor/index.html b/instructor/index.html new file mode 100644 index 000000000..bff8efe7c --- /dev/null +++ b/instructor/index.html @@ -0,0 +1,624 @@ + +R for Reproducible Scientific Analysis: Summary and Schedule +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Summary and Schedule

+ + +

an introduction to R for non-programmers using gapminder +data


The goal of this lesson is to teach novice programmers to write +modular code and best practices for using R for data analysis. R is +commonly used in many scientific disciplines for statistical analysis +and its array of third-party packages. We find that many scientists who +come to Software Carpentry workshops use R and want to learn more. The +emphasis of these materials is to give attendees a strong foundation in +the fundamentals of R, and to teach best practices for scientific +computing: breaking down analyses into modular units, task automation, +and encapsulation.


Note that this workshop will focus on teaching the fundamentals of +the programming language R, and will not teach statistical analysis.


The lesson contains more material than can be taught in a day. The instructor notes page has some +suggested lesson plans suitable for a one or half day workshop.


A variety of third party packages are used throughout this workshop. +These are not necessarily the best, nor are they comprehensive, but they +are packages we find useful, and have been chosen primarily for their +usability.

+ +

Prerequisites +


Understand that computers store data and instructions (programs, +scripts etc.) in files. Files are organised in directories (folders). +Know how to access files not in the working directory by specifying the +path.

+ + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor. +


This lesson assumes you have R and RStudio installed on your +computer.

+ + +
+ + + diff --git a/instructor/instructor-notes.html b/instructor/instructor-notes.html new file mode 100644 index 000000000..ee9c835ad --- /dev/null +++ b/instructor/instructor-notes.html @@ -0,0 +1,641 @@ + + + + + +R for Reproducible Scientific Analysis: Instructor Notes + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + + +

Instructor Notes

+ + +

Timing +


Leave about 30 minutes at the start of each workshop and another 15 +mins at the start of each session for technical difficulties like WiFi +and installing things (even if you asked students to install in advance, +longer if not).


Lesson Plans +


The lesson contains much more material than can be taught in a day. +Instructors will need to pick an appropriate subset of episodes to use +in a standard one day course.


Some suggested paths through the material are:


(suggested by @liz-is)

  • 01 Introduction to R and RStudio
  • +
  • 04 Data Structures
  • +
  • 05 Exploring Data Frames (“Realistic example” section onwards)
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2
  • +
  • 10 Functions Explained
  • +
  • 13 Dataframe Manipulation with dplyr
  • +
  • 15 Producing Reports With knitr
  • +

(suggested by @naupaka)

  • 01 Introduction to R and RStudio
  • +
  • 02 Project Management With RStudio
  • +
  • 03 Seeking Help
  • +
  • 04 Data Structures
  • +
  • 05 Exploring Data Frames
  • +
  • 06 Subsetting Data
  • +
  • 09 Vectorization
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2 OR 13 +Dataframe Manipulation with dplyr
  • +
  • 15 Producing Reports With knitr
  • +

A half day course could consist of (suggested by @karawoo):

  • 01 Introduction to R and RStudio
  • +
  • 04 Data Structures (only creating vectors with +c())
  • +
  • 05 Exploring Data Frames (“Realistic example” section onwards)
  • +
  • 06 Subsetting Data (excluding factor, matrix and list +subsetting)
  • +
  • 08 Creating Publication-Quality Graphics with ggplot2
  • +

Setting up git in RStudio +


There can be difficulties linking git to RStudio depending on the +operating system and the version of the operating system. To make sure +Git is properly installed and configured, the learners should go to the +Options window in the RStudio application.

  • +Mac OS X: +
    • Go RStudio -> Preferences… -> Git/SVN
    • +
    • Check and see whether there is a path to a file in the “Git +executable” window. If not, the next challenge is figuring out where Git +is located.
    • +
    • In the terminal enter which git and you will get a path +to the git executable. In the “Git executable” window you may have +difficulties finding the directory since OS X hides many of the +operating system files. While the file selection window is open, +pressing “Command-Shift-G” will pop up a text entry box where you will +be able to type or paste in the full path to your git executable: +e.g. /usr/bin/git or whatever else it might be.
    • +
  • +
  • +Windows: +
    • Go Tools -> Global options… -> Git/SVN
    • +
    • If you use the Software Carpentry Installer, then ‘git.exe’ should +be installed at C:/Program Files/Git/bin/git.exe.
    • +
  • +

To prevent the learners from having to re-enter their password each +time they push a commit to GitHub, this command (which can be run from a +bash prompt) will make it so they only have to enter their password +once:



$ git config --global credential.helper 'cache --timeout=10000000'

Pulling in Data +


The easiest way to get the data used in this lesson during a workshop +is to have attendees download the raw data from gapminder-data and gapminder-data-wide.


Attendees can use the File - Save As dialog in their +browser to save the file.


Overall +


Make sure to emphasize good practices: put code in scripts, and make +sure they’re version controlled. Encourage students to create script +files for challenges.


If you’re working in a cloud environment, get them to upload the +gapminder data after the second lesson.


Make sure to emphasize that matrices are vectors underneath the hood +and data frames are lists underneath the hood: this will explain a lot +of the esoteric behaviour encountered in basic operations.


Vector recycling and function stacks are probably best explained with +diagrams on a whiteboard.


Be sure to actually go through examples of an R help page: help files +can be intimidating at first, but knowing how to read them is +tremendously useful.


Be sure to show the CRAN task views, look at one of the topics.


There’s a lot of content: move quickly through the earlier lessons. +Their extensiveness is mostly for purposes of learning by osmosis: so +that their memory will trigger later when they encounter a problem or +some esoteric behaviour.


Key lessons to take time on:

  • Data subsetting - conceptually difficult for novices
  • +
  • Functions - learners especially struggle with this
  • +
  • Data structures - worth being thorough, but you can go through it +quickly.
  • +

Don’t worry about being correct or knowing the material +back-to-front. Use mistakes as teaching moments: the most vital skill +you can impart is how to debug and recover from unexpected errors.


Introduction to R and RStudio


Project Management With RStudio


Seeking Help


Data Structures


Exploring Data Frames


Subsetting Data


Control Flow


Creating Publication-Quality Graphics with ggplot2




Functions Explained


Writing Data


Splitting and Combining Data Frames with plyr


Data Frame Manipulation with dplyr


Data Frame Manipulation with tidyr


Producing Reports With knitr


Writing Good Software

+ + +
+ + +
+ + + + + diff --git a/instructor/key-points.html b/instructor/key-points.html new file mode 100644 index 000000000..fb94941a2 --- /dev/null +++ b/instructor/key-points.html @@ -0,0 +1,619 @@ + + + + + +R for Reproducible Scientific Analysis: Key Points + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + + +
+ + +

Introduction to R and RStudio

  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +

Project Management With RStudio

  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +

Seeking Help

  • Use help() to get online help in R.
  • +

Data Structures

  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +

Exploring Data Frames

  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +

Subsetting Data

  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +

Control Flow

  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +

Creating Publication-Quality Graphics with ggplot2

  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +


  • Use vectorized operations instead of loops.
  • +

Functions Explained

  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +

Writing Data

  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +

Splitting and Combining Data Frames with plyr

  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +

Data Frame Manipulation with dplyr

  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +

Data Frame Manipulation with tidyr

  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +

Producing Reports With knitr

  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +

Writing Good Software

  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+ + +
+ + +
+ + + + + diff --git a/instructor/profiles.html b/instructor/profiles.html new file mode 100644 index 000000000..82d9b3080 --- /dev/null +++ b/instructor/profiles.html @@ -0,0 +1,401 @@ + +R for Reproducible Scientific Analysis: Learner Profiles +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +

Learner Profiles

+ +

This is a placeholder file. Please add content here.

+ +
+ + +
+ + + diff --git a/instructor/reference.html b/instructor/reference.html new file mode 100644 index 000000000..d7417cfee --- /dev/null +++ b/instructor/reference.html @@ -0,0 +1,963 @@ + +R for Reproducible Scientific Analysis: Reference +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + +



Last updated on 2023-10-26 | + + Edit this page

+ + + + + +
+ +
+ + + +

Reference +


+Introduction to R and +RStudio +

  • Use the escape key to cancel incomplete commands or running code +(Ctrl+C) if you’re using R from the shell.
  • +
  • Basic arithmetic operations follow standard order of precedence: +
    • Brackets: (, ) +
    • +
    • Exponents: ^ or ** +
    • +
    • Divide: / +
    • +
    • Multiply: * +
    • +
    • Add: + +
    • +
    • Subtract: - +
    • +
  • +
  • Scientific notation is available, e.g: 2e-3 +
  • +
  • Anything to the right of a # is a comment, R will +ignore this!
  • +
  • Functions are denoted by function_name(). Expressions +inside the brackets are evaluated before being passed to the function, +and functions can be nested.
  • +
  • Mathematical functions: exp, sin, +log, log10, log2 etc.
  • +
  • Comparison operators: <, <=, +>, >=, ==, +!= +
  • +
  • Use all.equal to compare numbers!
  • +
  • +<- is the assignment operator. Anything to the right +is evaluate, then stored in a variable named to the left.
  • +
  • +ls lists all variables and functions you’ve +created
  • +
  • +rm can be used to remove them
  • +
  • When assigning values to function arguments, you must use +=.
  • +

+Project management with +RStudio +

  • To create a new project, go to File -> New Project
  • +
  • Install the packrat package to create self-contained +projects
  • +
  • +install.packages to install packages from CRAN
  • +
  • +library to load a package into R
  • +
  • +packrat::status to check whether all packages +referenced in your scripts have been installed.
  • +

+Seeking help +

  • To access help for a function type ?function_name or +help(function_name) +
  • +
  • Use quotes for special operators e.g. ?"+" +
  • +
  • Use fuzzy search if you can’t remember a name ‘??search_term’
  • +
  • +CRAN task +views are a good starting point.
  • +
  • +Stack Overflow is a good +place to get help with your code. +
    • +?dput will dump data you are working from so others can +load it easily.
    • +
    • +sessionInfo() will give details of your setup that +others may need for debugging.
    • +
  • +

+Data structures +


Individual values in R must be one of 5 data types, +multiple values can be grouped in data structures.


Data types

  • typeof(object) gives information about an items data +type.

  • +
  • +

    There are 5 main data types:

    • +?numeric real (decimal) numbers
    • +
    • +?integer whole numbers only
    • +
    • +?character text
    • +
    • +?complex complex numbers
    • +
    • +?logical TRUE or FALSE values
    • +

    Special types:

    • +?NA missing values
    • +
    • +?NaN “not a number” for undefined values +(e.g. 0/0).
    • +
    • +?Inf, -Inf infinity.
    • +
    • +?NULL a data structure that doesn’t exist
    • +

    NA can occur in any atomic vector. NaN, and +Inf can only occur in complex, integer or numeric type +vectors. Atomic vectors are the building blocks for all other data +structures. A NULL value will occur in place of an entire +data structure (but can occur as list elements).

  • +

Basic data structures in R:

  • atomic ?vector (can only contain one type)
  • +
  • +?list (containers for other objects)
  • +
  • +?data.frame two dimensional objects whose columns can +contain different types of data
  • +
  • +?matrix two dimensional objects that can contain only +one type of data.
  • +
  • +?factor vectors that contain predefined categorical +data.
  • +
  • +?array multi-dimensional objects that can only contain +one type of data
  • +

Remember that matrices are really atomic vectors underneath the hood, +and that data.frames are really lists underneath the hood (this explains +some of the weirder behaviour of R).



  • +?vector() All items in a vector must be the same +type.
  • +
  • Items can be converted from one type to another using +coercion.
  • +
  • The concatenate function ‘c()’ will append items to a vector.
  • +
  • +seq(from=0, to=1, by=1) will create a sequence of +numbers.
  • +
  • Items in a vector can be named using the names() +function.
  • +


  • +?factor() Factors are a data structure designed to +store categorical data.
  • +
  • +levels() shows the valid values that can be stored in a +vector of type factor.
  • +


  • +?list() Lists are a data structure designed to store +data of different types.
  • +


  • +?matrix() Matrices are a data structure designed to +store 2-dimensional data.
  • +

Data +Frames

  • +?data.frame is a key data structure. It is a +list of vectors.
  • +
  • +cbind() will add a column (vector) to a +data.frame.
  • +
  • +rbind() will add a row (list) to a data.frame.
  • +

Useful functions for querying data structures:

  • +?str structure, prints out a summary of the whole data +structure
  • +
  • +?typeof tells you the type inside an atomic vector
  • +
  • +?class what is the data structure?
  • +
  • +?head print the first n elements (rows for +two-dimensional objects)
  • +
  • +?tail print the last n elements (rows for +two-dimensional objects)
  • +
  • +?rownames, ?colnames, +?dimnames retrieve or modify the row names and column names +of an object.
  • +
  • +?names retrieve or modify the names of an atomic vector +or list (or columns of a data.frame).
  • +
  • +?length get the number of elements in an atomic +vector
  • +
  • +?nrow, ?ncol, ?dim get the +dimensions of a n-dimensional object (Won’t work on atomic vectors or +lists).
  • +

+Exploring Data +Frames +

  • +read.csv to read in data in a regular structure +
    • +sep argument to specify the separator +
      • “,” for comma separated
      • +
      • “\t” for tab separated
      • +
    • +
    • Other arguments: +
      • +header=TRUE if there is a header row
      • +
    • +
  • +

+Subsetting data +

  • +

    Elements can be accessed by:

    • Index
    • +
    • Name
    • +
    • Logical vectors
    • +
  • +
  • +

    [ single square brackets:

    • +extract single elements or subset vectors
    • +
    • e.g.x[1] extracts the first item from vector x.
    • +
    • +extract single elements of a list. The returned value will +be another list().
    • +
    • +extract columns from a data.frame
    • +
  • +
  • +

    [ with two arguments to:

    • +extract rows and/or columns of +
      • matrices
      • +
      • data.frames
      • +
      • e.g. x[1,2] will extract the value in row 1, column +2.
      • +
      • e.g. x[2,:] will extract the entire second column of +values.
      • +
    • +
  • +
  • [[ double square brackets to extract items from +lists.

  • +
  • $ to access columns or list elements by +name

  • +
  • negative indices skip elements

  • +

+Control flow +

  • Use if condition to start a conditional statement, +else if condition to provide additional tests, and +else to provide a default
  • +
  • The bodies of the branches of conditional statements must be +indented.
  • +
  • Use == to test for equality.
  • +
  • +%in% will return a TRUE/FALSE +indicating if there is a match between an element and a vector.
  • +
  • +X && Y is only true if both X and Y are +TRUE.
  • +
  • +X || Y is true if either X or Y, or both, are +TRUE.
  • +
  • Zero is considered FALSE; all other numbers are +considered TRUE +
  • +
  • Nest loops to operate on multi-dimensional data.
  • +

+Creating publication quality +graphics +

  • figures can be created with the grammar of graphics: +
    • library(ggplot2)
    • +
    • +ggplot to create the base figure
    • +
    • +aesthetics specify the data axes, shape, color, and +data size
    • +
    • +geometry functions specify the type of plot, +e.g. point, line, density, +box +
    • +
    • +geometry functions also add statistical transforms, +e.g. geom_smooth +
    • +
    • +scale functions change the mapping from data to +aesthetics
    • +
    • +facet functions stratify the figure into panels
    • +
    • +aesthetics apply to individual layers, or can be set +for the whole plot inside ggplot.
    • +
    • +theme functions change the overall look of the +plot
    • +
    • order of layers matters!
    • +
    • +ggsave to save a figure.
    • +
  • +

+Vectorization +

  • Most functions and operations apply to each element of a vector
  • +
  • +* applies element-wise to matrices
  • +
  • +%*% for true matrix multiplication
  • +
  • +any() will return TRUE if any element of a +vector is TRUE +
  • +
  • +all() will return TRUE if all +elements of a vector are TRUE +
  • +

+Functions explained +

  • ?"function"
  • +
  • Put code whose parameters change frequently in a function, then call +it with different parameter values to customize its behavior.
  • +
  • The last line of a function is returned, or you can use +return explicitly
  • +
  • Any code written in the body of the function will preferably look +for variables defined inside the function.
  • +
  • Document Why, then What, then lastly How (if the code isn’t self +explanatory)
  • +

+Writing data +

  • +write.table to write out objects in regular format
  • +
  • set quote=FALSE so that text isn’t wrapped in +" marks
  • +

+Split-apply-combine +

  • Use the xxply family of functions to apply functions to +groups within some data.
  • +
  • the first letter, array , data.frame or +list corresponds to the input data
  • +
  • the second letter denotes the output data structure
  • +
  • Anonymous functions (those not assigned a name) are used inside the +plyr family of functions on groups within data.
  • +

+Dataframe manipulation with dplyr +

  • library(dplyr)
  • +
  • +?select to extract variables by name.
  • +
  • +?filter return rows with matching conditions.
  • +
  • +?group_by group data by one of more variables.
  • +
  • +?summarize summarize multiple values to a single +value.
  • +
  • +?mutate add new variables to a data.frame.
  • +
  • Combine operations using the ?"%>%" pipe +operator.
  • +

+Dataframe manipulation with tidyr +

  • library(tidyr)
  • +
  • +?pivot_longer convert data from wide to +long format.
  • +
  • +?pivot_wider convert data from long to +wide format.
  • +
  • +?separate split a single value into multiple +values.
  • +
  • +?unite merge multiple values into a single value.
  • +

+Producing reports with +knitr +

  • Value of reproducible reports
  • +
  • Basics of Markdown
  • +
  • R code chunks
  • +
  • Chunk options
  • +
  • Inline R code
  • +
  • Other output formats
  • +

+Best practices for writing good +code +

  • Program defensively, i.e., assume that errors are going to arise, +and write code to detect them when they do.
  • +
  • Write tests before writing code in order to help determine exactly +what that code is supposed to do.
  • +
  • Know what code is supposed to do before trying to debug it.
  • +
  • Make it fail every time.
  • +
  • Make it fail fast.
  • +
  • Change one thing at a time, and for a reason.
  • +
  • Keep track of what you’ve done.
  • +
  • Be humble
  • +

Glossary +

+A value given to a function or program when it runs. The term is often +used interchangeably (and inconsistently) with parameter. +
+To give a value a name by associating a variable with it. +
+(of a function): the statements that are executed when a function runs. +
+A remark in a program that is intended to help human readers understand +what is going on, but is ignored by the computer. Comments in Python, R, +and the Unix shell start with a # character and run to the +end of the line; comments in SQL start with --, and other +languages have other conventions. +
comma-separated values
+(CSV) A common textual representation for tables in which the values in +each row are separated by commas. +
+A character or characters used to separate individual values, such as +the commas between columns in a CSV file. +
+Human-language text written to explain what software does, how it works, +or how to use it. +
floating-point number
+A number containing a fractional part and an exponent. See also: integer. +
for loop
+A loop that is executed once for each value in some kind of set, list, +or range. See also: while loop. +
+A subscript that specifies the location of a single value in a +collection, such as a single pixel in an image. +
+A whole number, such as -12343. See also: floating-point number. +
+In R, the directory(ies) where packages are +stored. +
+A collection of R functions, data and compiled code in a well-defined +format. Packages are stored in a library and +loaded using the library() function. +
+A variable named in the function’s declaration that is used to hold a +value passed into the call. The term is often used interchangeably (and +inconsistently) with argument. +
return statement
+A statement that causes a function to stop executing and return a value +to its caller immediately. +
+A collection of information that is presented in a specific order. +
+An array’s dimensions, represented as a vector. For example, a 5×3 +array’s shape is (5,3). +
+Short for “character string”, a sequence of zero +or more characters. +
syntax error
+A programming error that occurs when statements are in an order or +contain characters not expected by the programming language. +
+The classification of something in a program (for example, the contents +of a variable) as a kind of number (e.g. floating-point, integer), string, or something else. In R the command typeof() +is used to query a variables type. +
while loop
+A loop that keeps executing as long as some condition is true. See also: +for loop. +
+ + +
+ + + diff --git a/key-points.html b/key-points.html new file mode 100644 index 000000000..bcbd8388e --- /dev/null +++ b/key-points.html @@ -0,0 +1,623 @@ + + + + + +R for Reproducible Scientific Analysis: Key Points + + + + + + + + + + + +
+ R for Reproducible Scientific Analysis +
+ +
+ + + + + + +
+ + +

Introduction to R and RStudio

  • Use RStudio to write and run R programs.
  • +
  • R has the usual arithmetic operators and mathematical +functions.
  • +
  • Use <- to assign values to variables.
  • +
  • Use ls() to list the variables in a program.
  • +
  • Use rm() to delete objects in a program.
  • +
  • Use install.packages() to install packages +(libraries).
  • +

Project Management With RStudio

  • Use RStudio to create and manage projects with consistent +layout.
  • +
  • Treat raw data as read-only.
  • +
  • Treat generated output as disposable.
  • +
  • Separate function definition and application.
  • +

Seeking Help

  • Use help() to get online help in R.
  • +

Data Structures

  • Use read.csv to read tabular data in R.
  • +
  • The basic data types in R are double, integer, complex, logical, and +character.
  • +
  • Data structures such as data frames or matrices are built on top of +lists and vectors, with some added attributes.
  • +

Exploring Data Frames

  • Use cbind() to add a new column to a data frame.
  • +
  • Use rbind() to add a new row to a data frame.
  • +
  • Remove rows from a data frame.
  • +
  • Use str(), summary(), nrow(), +ncol(), dim(), colnames(), +rownames(), head(), and typeof() +to understand the structure of a data frame.
  • +
  • Read in a csv file using read.csv().
  • +
  • Understand what length() of a data frame +represents.
  • +

Subsetting Data

  • Indexing in R starts at 1, not 0.
  • +
  • Access individual values by location using [].
  • +
  • Access slices of data using [low:high].
  • +
  • Access arbitrary sets of data using [c(...)].
  • +
  • Use logical operations and logical vectors to access subsets of +data.
  • +

Control Flow

  • Use if and else to make choices.
  • +
  • Use for to repeat operations.
  • +

Creating Publication-Quality Graphics with ggplot2

  • Use ggplot2 to create plots.
  • +
  • Think about graphics in layers: aesthetics, geometry, statistics, +scale transformation, and grouping.
  • +


  • Use vectorized operations instead of loops.
  • +

Functions Explained

  • Use function to define a new function in R.
  • +
  • Use parameters to pass values into functions.
  • +
  • Use stopifnot() to flexibly check function arguments in +R.
  • +
  • Load functions into programs using source().
  • +

Writing Data

  • Save plots from RStudio using the ‘Export’ button.
  • +
  • Use write.table to save tabular data.
  • +

Splitting and Combining Data Frames with plyr

  • Use the plyr package to split data, apply functions to +subsets, and combine the results.
  • +

Data Frame Manipulation with dplyr

  • Use the dplyr package to manipulate data frames.
  • +
  • Use select() to choose variables from a data +frame.
  • +
  • Use filter() to choose data based on values.
  • +
  • Use group_by() and summarize() to work +with subsets of data.
  • +
  • Use mutate() to create new variables.
  • +

Data Frame Manipulation with tidyr

  • Use the tidyr package to change the layout of data +frames.
  • +
  • Use pivot_longer() to go from wide to longer +layout.
  • +
  • Use pivot_wider() to go from long to wider layout.
  • +

Producing Reports With knitr

  • Mix reporting written in R Markdown with software written in R.
  • +
  • Specify chunk options to control formatting.
  • +
  • Use knitr to convert these documents into PDF and other +formats.
  • +

Writing Good Software

  • Keep your project folder structured, organized and tidy.
  • +
  • Document what and why, not how.
  • +
  • Break programs into short single-purpose functions.
  • +
  • Write re-runnable tests.
  • +
  • Don’t repeat yourself.
  • +
  • Be consistent in naming, indentation, and other aspects of +style.
  • +
+ + +
+ + +
+ + + + + diff --git a/link.svg b/link.svg new file mode 100644 index 000000000..88ad82769 --- /dev/null +++ b/link.svg @@ -0,0 +1,12 @@ + + + + + + diff --git a/md5sum.txt b/md5sum.txt new file mode 100644 index 000000000..8f9f0c33a --- /dev/null +++ b/md5sum.txt @@ -0,0 +1,27 @@ +"file" "checksum" "built" "date" +"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2023-10-26" +"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2023-10-26" +"config.yaml" "810028d39c377c82aef9239cb1ec0dd3" "site/built/config.yaml" "2023-10-26" +"index.md" "86c8fb559b13d1695d55b52dd6cbf574" "site/built/index.md" "2023-10-26" +"episodes/01-rstudio-intro.Rmd" "5e73c9f0c60d736ea458abe379ecef68" "site/built/01-rstudio-intro.md" "2023-10-26" +"episodes/02-project-intro.Rmd" "94e7911ebdd59fbc30de86ed1d84d4df" "site/built/02-project-intro.md" "2023-10-26" +"episodes/03-seeking-help.Rmd" "d24c310b8f36930e70379458f3c93461" "site/built/03-seeking-help.md" "2023-10-26" +"episodes/04-data-structures-part1.Rmd" "5ec938f71a9cec633cef9329d214c3a0" "site/built/04-data-structures-part1.md" "2023-10-26" +"episodes/05-data-structures-part2.Rmd" "de6c6ee224fa7201674d87844c9ede02" "site/built/05-data-structures-part2.md" "2023-10-26" +"episodes/06-data-subsetting.Rmd" "5d4ce8731ab37ddea81874d63ae1ce86" "site/built/06-data-subsetting.md" "2023-10-26" +"episodes/07-control-flow.Rmd" "6a8691c8668737e4202f49b52aeb8ac6" "site/built/07-control-flow.md" "2023-10-26" +"episodes/08-plot-ggplot2.Rmd" "775bc2b258e11b4af447c7286bca2dd4" "site/built/08-plot-ggplot2.md" "2023-10-26" +"episodes/09-vectorization.Rmd" "e229eb061b3f072a132c4b31bbc2fdb0" "site/built/09-vectorization.md" "2023-10-26" +"episodes/10-functions.Rmd" "14edd4cf50edb8fefeb987a17d740e1a" "site/built/10-functions.md" "2023-10-26" +"episodes/11-writing-data.Rmd" "8b26e062dddd2394d00c6847ff0b7505" "site/built/11-writing-data.md" "2023-10-26" +"episodes/12-plyr.Rmd" "909597e71c188c682b5039036b4e95cf" "site/built/12-plyr.md" "2023-10-26" +"episodes/13-dplyr.Rmd" "3ad3687a1c860ddcf30ddcbb375153fb" "site/built/13-dplyr.md" "2023-10-26" +"episodes/14-tidyr.Rmd" "6ceb2a517a291c565cfbc0f76e2fb567" "site/built/14-tidyr.md" "2023-10-26" +"episodes/15-knitr-markdown.Rmd" "65188e4a8eaf3d04c6284db65c48c83e" "site/built/15-knitr-markdown.md" "2023-10-26" +"episodes/16-wrap-up.Rmd" "c5ce0d34a37b7a99624ad1d6ac482256" "site/built/16-wrap-up.md" "2023-10-26" +"instructors/instructor-notes.md" "5ce85301c3e8d78b4b8682ae8e6bb7ff" "site/built/instructor-notes.md" "2023-10-26" +"learners/discuss.md" "42ad66ab1907e030914dbb2a94376a47" "site/built/discuss.md" "2023-10-26" +"learners/reference.md" "b606f57847b81651e8102925ff3d19c1" "site/built/reference.md" "2023-10-26" +"learners/setup.md" "f888f8a54b071715c0cf56896e650c00" "site/built/setup.md" "2023-10-26" +"profiles/learner-profiles.md" "60b93493cf1da06dfd63255d73854461" "site/built/learner-profiles.md" "2023-10-26" +"renv/profiles/lesson-requirements/renv.lock" "d0863f3009013edce68caa0b832b8754" "site/built/renv.lock" "2023-10-26" diff --git a/mstile-150x150.png b/mstile-150x150.png new file mode 100644 index 000000000..8136f75e7 Binary files /dev/null and b/mstile-150x150.png differ diff --git a/pkgdown.css b/pkgdown.css new file mode 100644 index 000000000..80ea5b838 --- /dev/null +++ b/pkgdown.css @@ -0,0 +1,384 @@ +/* Sticky footer */ + +/** + * Basic idea: https://philipwalton.github.io/solved-by-flexbox/demos/sticky-footer/ + * Details: https://github.com/philipwalton/solved-by-flexbox/blob/master/assets/css/components/site.css + * + * .Site -> body > .container + * .Site-content -> body > .container .row + * .footer -> footer + * + * Key idea seems to be to ensure that .container and __all its parents__ + * have height set to 100% + * + */ + +html, body { + height: 100%; +} + +body { + position: relative; +} + +body > .container { + display: flex; + height: 100%; + flex-direction: column; +} + +body > .container .row { + flex: 1 0 auto; +} + +footer { + margin-top: 45px; + padding: 35px 0 36px; + border-top: 1px solid #e5e5e5; + color: #666; + display: flex; + flex-shrink: 0; +} +footer p { + margin-bottom: 0; +} +footer div { + flex: 1; +} +footer .pkgdown { + text-align: right; +} +footer p { + margin-bottom: 0; +} + +img.icon { + float: right; +} + +/* Ensure in-page images don't run outside their container */ +.contents img { + max-width: 100%; + height: auto; +} + +/* Fix bug in bootstrap (only seen in firefox) */ +summary { + display: list-item; +} + +/* Typographic tweaking ---------------------------------*/ + +.contents .page-header { + margin-top: calc(-60px + 1em); +} + +dd { + margin-left: 3em; +} + +/* Section anchors ---------------------------------*/ + +a.anchor { + display: none; + margin-left: 5px; + width: 20px; + height: 20px; + + background-image: url(./link.svg); + background-repeat: no-repeat; + background-size: 20px 20px; + background-position: center center; +} + +h1:hover .anchor, +h2:hover .anchor, +h3:hover .anchor, +h4:hover .anchor, +h5:hover .anchor, +h6:hover .anchor { + display: inline-block; +} + +/* Fixes for fixed navbar --------------------------*/ + +.contents h1, .contents h2, .contents h3, .contents h4 { + padding-top: 60px; + margin-top: -40px; +} + +/* Navbar submenu --------------------------*/ + +.dropdown-submenu { + position: relative; +} + +.dropdown-submenu>.dropdown-menu { + top: 0; + left: 100%; + margin-top: -6px; + margin-left: -1px; + border-radius: 0 6px 6px 6px; +} + +.dropdown-submenu:hover>.dropdown-menu { + display: block; +} + +.dropdown-submenu>a:after { + display: block; + content: " "; + float: right; + width: 0; + height: 0; + border-color: transparent; + border-style: solid; + border-width: 5px 0 5px 5px; + border-left-color: #cccccc; + margin-top: 5px; + margin-right: -10px; +} + +.dropdown-submenu:hover>a:after { + border-left-color: #ffffff; +} + +.dropdown-submenu.pull-left { + float: none; +} + +.dropdown-submenu.pull-left>.dropdown-menu { + left: -100%; + margin-left: 10px; + border-radius: 6px 0 6px 6px; +} + +/* Sidebar --------------------------*/ + +#pkgdown-sidebar { + margin-top: 30px; + position: -webkit-sticky; + position: sticky; + top: 70px; +} + +#pkgdown-sidebar h2 { + font-size: 1.5em; + margin-top: 1em; +} + +#pkgdown-sidebar h2:first-child { + margin-top: 0; +} + +#pkgdown-sidebar .list-unstyled li { + margin-bottom: 0.5em; +} + +/* bootstrap-toc tweaks ------------------------------------------------------*/ + +/* All levels of nav */ + +nav[data-toggle='toc'] .nav > li > a { + padding: 4px 20px 4px 6px; + font-size: 1.5rem; + font-weight: 400; + color: inherit; +} + +nav[data-toggle='toc'] .nav > li > a:hover, +nav[data-toggle='toc'] .nav > li > a:focus { + padding-left: 5px; + color: inherit; + border-left: 1px solid #878787; +} + +nav[data-toggle='toc'] .nav > .active > a, +nav[data-toggle='toc'] .nav > .active:hover > a, +nav[data-toggle='toc'] .nav > .active:focus > a { + padding-left: 5px; + font-size: 1.5rem; + font-weight: 400; + color: inherit; + border-left: 2px solid #878787; +} + +/* Nav: second level (shown on .active) */ + +nav[data-toggle='toc'] .nav .nav { + display: none; /* Hide by default, but at >768px, show it */ + padding-bottom: 10px; +} + +nav[data-toggle='toc'] .nav .nav > li > a { + padding-left: 16px; + font-size: 1.35rem; +} + +nav[data-toggle='toc'] .nav .nav > li > a:hover, +nav[data-toggle='toc'] .nav .nav > li > a:focus { + padding-left: 15px; +} + +nav[data-toggle='toc'] .nav .nav > .active > a, +nav[data-toggle='toc'] .nav .nav > .active:hover > a, +nav[data-toggle='toc'] .nav .nav > .active:focus > a { + padding-left: 15px; + font-weight: 500; + font-size: 1.35rem; +} + +/* orcid ------------------------------------------------------------------- */ + +.orcid { + font-size: 16px; + color: #A6CE39; + /* margins are required by official ORCID trademark and display guidelines */ + margin-left:4px; + margin-right:4px; + vertical-align: middle; +} + +/* Reference index & topics ----------------------------------------------- */ + +.ref-index th {font-weight: normal;} + +.ref-index td {vertical-align: top; min-width: 100px} +.ref-index .icon {width: 40px;} +.ref-index .alias {width: 40%;} +.ref-index-icons .alias {width: calc(40% - 40px);} +.ref-index .title {width: 60%;} + +.ref-arguments th {text-align: right; padding-right: 10px;} +.ref-arguments th, .ref-arguments td {vertical-align: top; min-width: 100px} +.ref-arguments .name {width: 20%;} +.ref-arguments .desc {width: 80%;} + +/* Nice scrolling for wide elements --------------------------------------- */ + +table { + display: block; + overflow: auto; +} + +/* Syntax highlighting ---------------------------------------------------- */ + +pre, code, pre code { + background-color: #f8f8f8; + color: #333; +} +pre, pre code { + white-space: pre-wrap; + word-break: break-all; + overflow-wrap: break-word; +} + +pre { + border: 1px solid #eee; +} + +pre .img, pre .r-plt { + margin: 5px 0; +} + +pre .img img, pre .r-plt img { + background-color: #fff; +} + +code a, pre a { + color: #375f84; +} + +a.sourceLine:hover { + text-decoration: none; +} + +.fl {color: #1514b5;} +.fu {color: #000000;} /* function */ +.ch,.st {color: #036a07;} /* string */ +.kw {color: #264D66;} /* keyword */ +.co {color: #888888;} /* comment */ + +.error {font-weight: bolder;} +.warning {font-weight: bolder;} + +/* Clipboard --------------------------*/ + +.hasCopyButton { + position: relative; +} + +.btn-copy-ex { + position: absolute; + right: 0; + top: 0; + visibility: hidden; +} + +.hasCopyButton:hover button.btn-copy-ex { + visibility: visible; +} + +/* headroom.js ------------------------ */ + +.headroom { + will-change: transform; + transition: transform 200ms linear; +} +.headroom--pinned { + transform: translateY(0%); +} +.headroom--unpinned { + transform: translateY(-100%); +} + +/* mark.js ----------------------------*/ + +mark { + background-color: rgba(255, 255, 51, 0.5); + border-bottom: 2px solid rgba(255, 153, 51, 0.3); + padding: 1px; +} + +/* vertical spacing after htmlwidgets */ +.html-widget { + margin-bottom: 10px; +} + +/* fontawesome ------------------------ */ + +.fab { + font-family: "Font Awesome 5 Brands" !important; +} + +/* don't display links in code chunks when printing */ +/* source: https://stackoverflow.com/a/10781533 */ +@media print { + code a:link:after, code a:visited:after { + content: ""; + } +} + +/* Section anchors --------------------------------- + Added in pandoc 2.11: https://github.com/jgm/pandoc-templates/commit/9904bf71 +*/ + +div.csl-bib-body { } +div.csl-entry { + clear: both; +} +.hanging-indent div.csl-entry { + margin-left:2em; + text-indent:-2em; +} +div.csl-left-margin { + min-width:2em; + float:left; +} +div.csl-right-inline { + margin-left:2em; + padding-left:1em; +} +div.csl-indent { + margin-left: 2em; +} diff --git a/pkgdown.js b/pkgdown.js new file mode 100644 index 000000000..6f0eee40b --- /dev/null +++ b/pkgdown.js @@ -0,0 +1,108 @@ +/* http://gregfranko.com/blog/jquery-best-practices/ */ +(function($) { + $(function() { + + $('.navbar-fixed-top').headroom(); + + $('body').css('padding-top', $('.navbar').height() + 10); + $(window).resize(function(){ + $('body').css('padding-top', $('.navbar').height() + 10); + }); + + $('[data-toggle="tooltip"]').tooltip(); + + var cur_path = paths(location.pathname); + var links = $("#navbar ul li a"); + var max_length = -1; + var pos = -1; + for (var i = 0; i < links.length; i++) { + if (links[i].getAttribute("href") === "#") + continue; + // Ignore external links + if (links[i].host !== location.host) + continue; + + var nav_path = paths(links[i].pathname); + + var length = prefix_length(nav_path, cur_path); + if (length > max_length) { + max_length = length; + pos = i; + } + } + + // Add class to parent
  • , and enclosing
  • if in dropdown + if (pos >= 0) { + var menu_anchor = $(links[pos]); + menu_anchor.parent().addClass("active"); + menu_anchor.closest("li.dropdown").addClass("active"); + } + }); + + function paths(pathname) { + var pieces = pathname.split("/"); + pieces.shift(); // always starts with / + + var end = pieces[pieces.length - 1]; + if (end === "index.html" || end === "") + pieces.pop(); + return(pieces); + } + + // Returns -1 if not found + function prefix_length(needle, haystack) { + if (needle.length > haystack.length) + return(-1); + + // Special case for length-0 haystack, since for loop won't run + if (haystack.length === 0) { + return(needle.length === 0 ? 0 : -1); + } + + for (var i = 0; i < haystack.length; i++) { + if (needle[i] != haystack[i]) + return(i); + } + + return(haystack.length); + } + + /* Clipboard --------------------------*/ + + function changeTooltipMessage(element, msg) { + var tooltipOriginalTitle=element.getAttribute('data-original-title'); + element.setAttribute('data-original-title', msg); + $(element).tooltip('show'); + element.setAttribute('data-original-title', tooltipOriginalTitle); + } + + if(ClipboardJS.isSupported()) { + $(document).ready(function() { + var copyButton = ""; + + $("div.sourceCode").addClass("hasCopyButton"); + + // Insert copy buttons: + $(copyButton).prependTo(".hasCopyButton"); + + // Initialize tooltips: + $('.btn-copy-ex').tooltip({container: 'body'}); + + // Initialize clipboard: + var clipboardBtnCopies = new ClipboardJS('[data-clipboard-copy]', { + text: function(trigger) { + return trigger.parentNode.textContent.replace(/\n#>[^\n]*/g, ""); + } + }); + + clipboardBtnCopies.on('success', function(e) { + changeTooltipMessage(e.trigger, 'Copied!'); + e.clearSelection(); + }); + + clipboardBtnCopies.on('error', function() { + changeTooltipMessage(e.trigger,'Press Ctrl+C or Command+C to copy'); + }); + }); + } +})(window.jQuery || window.$) diff --git a/pkgdown.yml b/pkgdown.yml new file mode 100644 index 000000000..7c519fc7c --- /dev/null +++ b/pkgdown.yml @@ -0,0 +1,6 @@ +pandoc: 2.19.2 +pkgdown: 2.0.7 +pkgdown_sha: ~ +articles: {} +last_built: 2023-10-26T09:56Z + diff --git a/profiles.html b/profiles.html new file mode 100644 index 000000000..0c36fda1b --- /dev/null +++ b/profiles.html @@ -0,0 +1,402 @@ + +R for Reproducible Scientific Analysis: Learner Profiles +
    + R for Reproducible Scientific Analysis +
    + +
    + + + + + +

    Learner Profiles

    + +

    This is a placeholder file. Please add content here.

    + +
    + + +
    + + + diff --git a/reference.html b/reference.html new file mode 100644 index 000000000..3dff36c9e --- /dev/null +++ b/reference.html @@ -0,0 +1,962 @@ + +R for Reproducible Scientific Analysis: Reference +
    + R for Reproducible Scientific Analysis +
    + +
    + + + + + +



    Last updated on 2023-10-26 | + + Edit this page

    + + + +
    + +
    + + + +

    Reference +


    +Introduction to R and +RStudio +

    • Use the escape key to cancel incomplete commands or running code +(Ctrl+C) if you’re using R from the shell.
    • +
    • Basic arithmetic operations follow standard order of precedence: +
      • Brackets: (, ) +
      • +
      • Exponents: ^ or ** +
      • +
      • Divide: / +
      • +
      • Multiply: * +
      • +
      • Add: + +
      • +
      • Subtract: - +
      • +
    • +
    • Scientific notation is available, e.g: 2e-3 +
    • +
    • Anything to the right of a # is a comment, R will +ignore this!
    • +
    • Functions are denoted by function_name(). Expressions +inside the brackets are evaluated before being passed to the function, +and functions can be nested.
    • +
    • Mathematical functions: exp, sin, +log, log10, log2 etc.
    • +
    • Comparison operators: <, <=, +>, >=, ==, +!= +
    • +
    • Use all.equal to compare numbers!
    • +
    • +<- is the assignment operator. Anything to the right +is evaluate, then stored in a variable named to the left.
    • +
    • +ls lists all variables and functions you’ve +created
    • +
    • +rm can be used to remove them
    • +
    • When assigning values to function arguments, you must use +=.
    • +

    +Project management with +RStudio +

    • To create a new project, go to File -> New Project
    • +
    • Install the packrat package to create self-contained +projects
    • +
    • +install.packages to install packages from CRAN
    • +
    • +library to load a package into R
    • +
    • +packrat::status to check whether all packages +referenced in your scripts have been installed.
    • +

    +Seeking help +

    • To access help for a function type ?function_name or +help(function_name) +
    • +
    • Use quotes for special operators e.g. ?"+" +
    • +
    • Use fuzzy search if you can’t remember a name ‘??search_term’
    • +
    • +CRAN task +views are a good starting point.
    • +
    • +Stack Overflow is a good +place to get help with your code. +
      • +?dput will dump data you are working from so others can +load it easily.
      • +
      • +sessionInfo() will give details of your setup that +others may need for debugging.
      • +
    • +

    +Data structures +


    Individual values in R must be one of 5 data types, +multiple values can be grouped in data structures.


    Data types

    • typeof(object) gives information about an items data +type.

    • +
    • +

      There are 5 main data types:

      • +?numeric real (decimal) numbers
      • +
      • +?integer whole numbers only
      • +
      • +?character text
      • +
      • +?complex complex numbers
      • +
      • +?logical TRUE or FALSE values
      • +

      Special types:

      • +?NA missing values
      • +
      • +?NaN “not a number” for undefined values +(e.g. 0/0).
      • +
      • +?Inf, -Inf infinity.
      • +
      • +?NULL a data structure that doesn’t exist
      • +

      NA can occur in any atomic vector. NaN, and +Inf can only occur in complex, integer or numeric type +vectors. Atomic vectors are the building blocks for all other data +structures. A NULL value will occur in place of an entire +data structure (but can occur as list elements).

    • +

    Basic data structures in R:

    • atomic ?vector (can only contain one type)
    • +
    • +?list (containers for other objects)
    • +
    • +?data.frame two dimensional objects whose columns can +contain different types of data
    • +
    • +?matrix two dimensional objects that can contain only +one type of data.
    • +
    • +?factor vectors that contain predefined categorical +data.
    • +
    • +?array multi-dimensional objects that can only contain +one type of data
    • +

    Remember that matrices are really atomic vectors underneath the hood, +and that data.frames are really lists underneath the hood (this explains +some of the weirder behaviour of R).



    • +?vector() All items in a vector must be the same +type.
    • +
    • Items can be converted from one type to another using +coercion.
    • +
    • The concatenate function ‘c()’ will append items to a vector.
    • +
    • +seq(from=0, to=1, by=1) will create a sequence of +numbers.
    • +
    • Items in a vector can be named using the names() +function.
    • +


    • +?factor() Factors are a data structure designed to +store categorical data.
    • +
    • +levels() shows the valid values that can be stored in a +vector of type factor.
    • +


    • +?list() Lists are a data structure designed to store +data of different types.
    • +


    • +?matrix() Matrices are a data structure designed to +store 2-dimensional data.
    • +

    Data +Frames

    • +?data.frame is a key data structure. It is a +list of vectors.
    • +
    • +cbind() will add a column (vector) to a +data.frame.
    • +
    • +rbind() will add a row (list) to a data.frame.
    • +

    Useful functions for querying data structures:

    • +?str structure, prints out a summary of the whole data +structure
    • +
    • +?typeof tells you the type inside an atomic vector
    • +
    • +?class what is the data structure?
    • +
    • +?head print the first n elements (rows for +two-dimensional objects)
    • +
    • +?tail print the last n elements (rows for +two-dimensional objects)
    • +
    • +?rownames, ?colnames, +?dimnames retrieve or modify the row names and column names +of an object.
    • +
    • +?names retrieve or modify the names of an atomic vector +or list (or columns of a data.frame).
    • +
    • +?length get the number of elements in an atomic +vector
    • +
    • +?nrow, ?ncol, ?dim get the +dimensions of a n-dimensional object (Won’t work on atomic vectors or +lists).
    • +

    +Exploring Data +Frames +

    • +read.csv to read in data in a regular structure +
      • +sep argument to specify the separator +
        • “,” for comma separated
        • +
        • “\t” for tab separated
        • +
      • +
      • Other arguments: +
        • +header=TRUE if there is a header row
        • +
      • +
    • +

    +Subsetting data +

    • +

      Elements can be accessed by:

      • Index
      • +
      • Name
      • +
      • Logical vectors
      • +
    • +
    • +

      [ single square brackets:

      • +extract single elements or subset vectors
      • +
      • e.g.x[1] extracts the first item from vector x.
      • +
      • +extract single elements of a list. The returned value will +be another list().
      • +
      • +extract columns from a data.frame
      • +
    • +
    • +

      [ with two arguments to:

      • +extract rows and/or columns of +
        • matrices
        • +
        • data.frames
        • +
        • e.g. x[1,2] will extract the value in row 1, column +2.
        • +
        • e.g. x[2,:] will extract the entire second column of +values.
        • +
      • +
    • +
    • [[ double square brackets to extract items from +lists.

    • +
    • $ to access columns or list elements by +name

    • +
    • negative indices skip elements

    • +

    +Control flow +

    • Use if condition to start a conditional statement, +else if condition to provide additional tests, and +else to provide a default
    • +
    • The bodies of the branches of conditional statements must be +indented.
    • +
    • Use == to test for equality.
    • +
    • +%in% will return a TRUE/FALSE +indicating if there is a match between an element and a vector.
    • +
    • +X && Y is only true if both X and Y are +TRUE.
    • +
    • +X || Y is true if either X or Y, or both, are +TRUE.
    • +
    • Zero is considered FALSE; all other numbers are +considered TRUE +
    • +
    • Nest loops to operate on multi-dimensional data.
    • +

    +Creating publication quality +graphics +

    • figures can be created with the grammar of graphics: +
      • library(ggplot2)
      • +
      • +ggplot to create the base figure
      • +
      • +aesthetics specify the data axes, shape, color, and +data size
      • +
      • +geometry functions specify the type of plot, +e.g. point, line, density, +box +
      • +
      • +geometry functions also add statistical transforms, +e.g. geom_smooth +
      • +
      • +scale functions change the mapping from data to +aesthetics
      • +
      • +facet functions stratify the figure into panels
      • +
      • +aesthetics apply to individual layers, or can be set +for the whole plot inside ggplot.
      • +
      • +theme functions change the overall look of the +plot
      • +
      • order of layers matters!
      • +
      • +ggsave to save a figure.
      • +
    • +

    +Vectorization +

    • Most functions and operations apply to each element of a vector
    • +
    • +* applies element-wise to matrices
    • +
    • +%*% for true matrix multiplication
    • +
    • +any() will return TRUE if any element of a +vector is TRUE +
    • +
    • +all() will return TRUE if all +elements of a vector are TRUE +
    • +

    +Functions explained +

    • ?"function"
    • +
    • Put code whose parameters change frequently in a function, then call +it with different parameter values to customize its behavior.
    • +
    • The last line of a function is returned, or you can use +return explicitly
    • +
    • Any code written in the body of the function will preferably look +for variables defined inside the function.
    • +
    • Document Why, then What, then lastly How (if the code isn’t self +explanatory)
    • +

    +Writing data +

    • +write.table to write out objects in regular format
    • +
    • set quote=FALSE so that text isn’t wrapped in +" marks
    • +

    +Split-apply-combine +

    • Use the xxply family of functions to apply functions to +groups within some data.
    • +
    • the first letter, array , data.frame or +list corresponds to the input data
    • +
    • the second letter denotes the output data structure
    • +
    • Anonymous functions (those not assigned a name) are used inside the +plyr family of functions on groups within data.
    • +

    +Dataframe manipulation with dplyr +

    • library(dplyr)
    • +
    • +?select to extract variables by name.
    • +
    • +?filter return rows with matching conditions.
    • +
    • +?group_by group data by one of more variables.
    • +
    • +?summarize summarize multiple values to a single +value.
    • +
    • +?mutate add new variables to a data.frame.
    • +
    • Combine operations using the ?"%>%" pipe +operator.
    • +

    +Dataframe manipulation with tidyr +

    • library(tidyr)
    • +
    • +?pivot_longer convert data from wide to +long format.
    • +
    • +?pivot_wider convert data from long to +wide format.
    • +
    • +?separate split a single value into multiple +values.
    • +
    • +?unite merge multiple values into a single value.
    • +

    +Producing reports with +knitr +

    • Value of reproducible reports
    • +
    • Basics of Markdown
    • +
    • R code chunks
    • +
    • Chunk options
    • +
    • Inline R code
    • +
    • Other output formats
    • +

    +Best practices for writing good +code +

    • Program defensively, i.e., assume that errors are going to arise, +and write code to detect them when they do.
    • +
    • Write tests before writing code in order to help determine exactly +what that code is supposed to do.
    • +
    • Know what code is supposed to do before trying to debug it.
    • +
    • Make it fail every time.
    • +
    • Make it fail fast.
    • +
    • Change one thing at a time, and for a reason.
    • +
    • Keep track of what you’ve done.
    • +
    • Be humble
    • +

    Glossary +

    +A value given to a function or program when it runs. The term is often +used interchangeably (and inconsistently) with parameter. +
    +To give a value a name by associating a variable with it. +
    +(of a function): the statements that are executed when a function runs. +
    +A remark in a program that is intended to help human readers understand +what is going on, but is ignored by the computer. Comments in Python, R, +and the Unix shell start with a # character and run to the +end of the line; comments in SQL start with --, and other +languages have other conventions. +
    comma-separated values
    +(CSV) A common textual representation for tables in which the values in +each row are separated by commas. +
    +A character or characters used to separate individual values, such as +the commas between columns in a CSV file. +
    +Human-language text written to explain what software does, how it works, +or how to use it. +
    floating-point number
    +A number containing a fractional part and an exponent. See also: integer. +
    for loop
    +A loop that is executed once for each value in some kind of set, list, +or range. See also: while loop. +
    +A subscript that specifies the location of a single value in a +collection, such as a single pixel in an image. +
    +A whole number, such as -12343. See also: floating-point number. +
    +In R, the directory(ies) where packages are +stored. +
    +A collection of R functions, data and compiled code in a well-defined +format. Packages are stored in a library and +loaded using the library() function. +
    +A variable named in the function’s declaration that is used to hold a +value passed into the call. The term is often used interchangeably (and +inconsistently) with argument. +
    return statement
    +A statement that causes a function to stop executing and return a value +to its caller immediately. +
    +A collection of information that is presented in a specific order. +
    +An array’s dimensions, represented as a vector. For example, a 5×3 +array’s shape is (5,3). +
    +Short for “character string”, a sequence of zero +or more characters. +
    syntax error
    +A programming error that occurs when statements are in an order or +contain characters not expected by the programming language. +
    +The classification of something in a program (for example, the contents +of a variable) as a kind of number (e.g. floating-point, integer), string, or something else. In R the command typeof() +is used to query a variables type. +
    while loop
    +A loop that keeps executing as long as some condition is true. See also: +for loop. +
    + + +
    + + + diff --git a/renv.lock b/renv.lock new file mode 100644 index 000000000..e7d643c03 --- /dev/null +++ b/renv.lock @@ -0,0 +1,1085 @@ +{ + "R": { + "Version": "4.3.1", + "Repositories": [ + { + "Name": "carpentries", + "URL": "https://carpentries.r-universe.dev" + }, + { + "Name": "carpentries_archive", + "URL": "https://carpentries.github.io/drat" + }, + { + "Name": "CRAN", + "URL": "https://cran.rstudio.com" + } + ] + }, + "Packages": { + "DiagrammeR": { + "Package": "DiagrammeR", + "Version": "1.0.10", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "RColorBrewer", + "downloader", + "dplyr", + "glue", + "htmltools", + "htmlwidgets", + "igraph", + "magrittr", + "purrr", + "readr", + "rlang", + "rstudioapi", + "scales", + "stringr", + "tibble", + "tidyr", + "viridis", + "visNetwork" + ], + "Hash": "f3de4a4878163a4629a528bbcc6e655d" + }, + "MASS": { + "Package": "MASS", + "Version": "7.3-60", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "methods", + "stats", + "utils" + ], + "Hash": "a56a6365b3fa73293ea8d084be0d9bb0" + }, + "Matrix": { + "Package": "Matrix", + "Version": "1.6-1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "grid", + "lattice", + "methods", + "stats", + "utils" + ], + "Hash": "cb6855ac711958ca734b75e631b2035d" + }, + "R6": { + "Package": "R6", + "Version": "2.5.1", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "470851b6d5d0ac559e9d01bb352b4021" + }, + "RColorBrewer": { + "Package": "RColorBrewer", + "Version": "1.1-3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "45f0398006e83a5b10b72a90663d8d8c" + }, + "Rcpp": { + "Package": "Rcpp", + "Version": "1.0.11", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "methods", + "utils" + ], + "Hash": "ae6cbbe1492f4de79c45fce06f967ce8" + }, + "base64enc": { + "Package": "base64enc", + "Version": "0.1-3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "543776ae6848fde2f48ff3816d0628bc" + }, + "bit": { + "Package": "bit", + "Version": "4.0.5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "d242abec29412ce988848d0294b208fd" + }, + "bit64": { + "Package": "bit64", + "Version": "4.0.5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "bit", + "methods", + "stats", + "utils" + ], + "Hash": "9fe98599ca456d6552421db0d6772d8f" + }, + "bslib": { + "Package": "bslib", + "Version": "0.5.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "base64enc", + "cachem", + "grDevices", + "htmltools", + "jquerylib", + "jsonlite", + "memoise", + "mime", + "rlang", + "sass" + ], + "Hash": "283015ddfbb9d7bf15ea9f0b5698f0d9" + }, + "cachem": { + "Package": "cachem", + "Version": "1.0.8", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "fastmap", + "rlang" + ], + "Hash": "c35768291560ce302c0a6589f92e837d" + }, + "cli": { + "Package": "cli", + "Version": "3.6.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "89e6d8219950eac806ae0c489052048a" + }, + "clipr": { + "Package": "clipr", + "Version": "0.8.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "utils" + ], + "Hash": "3f038e5ac7f41d4ac41ce658c85e3042" + }, + "colorspace": { + "Package": "colorspace", + "Version": "2.1-0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "methods", + "stats" + ], + "Hash": "f20c47fd52fae58b4e377c37bb8c335b" + }, + "cpp11": { + "Package": "cpp11", + "Version": "0.4.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "707fae4bbf73697ec8d85f9d7076c061" + }, + "crayon": { + "Package": "crayon", + "Version": "1.5.2", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "grDevices", + "methods", + "utils" + ], + "Hash": "e8a1e41acf02548751f45c718d55aa6a" + }, + "digest": { + "Package": "digest", + "Version": "0.6.33", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "b18a9cf3c003977b0cc49d5e76ebe48d" + }, + "downloader": { + "Package": "downloader", + "Version": "0.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "digest", + "utils" + ], + "Hash": "f4f2a915e0dedbdf016a83b63477349f" + }, + "dplyr": { + "Package": "dplyr", + "Version": "1.1.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "R6", + "cli", + "generics", + "glue", + "lifecycle", + "magrittr", + "methods", + "pillar", + "rlang", + "tibble", + "tidyselect", + "utils", + "vctrs" + ], + "Hash": "e85ffbebaad5f70e1a2e2ef4302b4949" + }, + "ellipsis": { + "Package": "ellipsis", + "Version": "0.3.2", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "rlang" + ], + "Hash": "bb0eec2fe32e88d9e2836c2f73ea2077" + }, + "evaluate": { + "Package": "evaluate", + "Version": "0.21", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "d59f3b464e8da1aef82dc04b588b8dfb" + }, + "fansi": { + "Package": "fansi", + "Version": "1.0.4", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "grDevices", + "utils" + ], + "Hash": "1d9e7ad3c8312a192dea7d3db0274fde" + }, + "farver": { + "Package": "farver", + "Version": "2.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "8106d78941f34855c440ddb946b8f7a5" + }, + "fastmap": { + "Package": "fastmap", + "Version": "1.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "f7736a18de97dea803bde0a2daaafb27" + }, + "fontawesome": { + "Package": "fontawesome", + "Version": "0.5.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "htmltools", + "rlang" + ], + "Hash": "c2efdd5f0bcd1ea861c2d4e2a883a67d" + }, + "fs": { + "Package": "fs", + "Version": "1.6.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "47b5f30c720c23999b913a1a635cf0bb" + }, + "generics": { + "Package": "generics", + "Version": "0.1.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "methods" + ], + "Hash": "15e9634c0fcd294799e9b2e929ed1b86" + }, + "ggplot2": { + "Package": "ggplot2", + "Version": "3.4.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "MASS", + "R", + "cli", + "glue", + "grDevices", + "grid", + "gtable", + "isoband", + "lifecycle", + "mgcv", + "rlang", + "scales", + "stats", + "tibble", + "vctrs", + "withr" + ], + "Hash": "85846544c596e71f8f46483ab165da33" + }, + "glue": { + "Package": "glue", + "Version": "1.6.2", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "methods" + ], + "Hash": "4f2596dfb05dac67b9dc558e5c6fba2e" + }, + "gridExtra": { + "Package": "gridExtra", + "Version": "2.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grDevices", + "graphics", + "grid", + "gtable", + "utils" + ], + "Hash": "7d7f283939f563670a697165b2cf5560" + }, + "gtable": { + "Package": "gtable", + "Version": "0.3.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "grid", + "lifecycle", + "rlang" + ], + "Hash": "b29cf3031f49b04ab9c852c912547eef" + }, + "highr": { + "Package": "highr", + "Version": "0.10", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "xfun" + ], + "Hash": "06230136b2d2b9ba5805e1963fa6e890" + }, + "hms": { + "Package": "hms", + "Version": "1.1.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "lifecycle", + "methods", + "pkgconfig", + "rlang", + "vctrs" + ], + "Hash": "b59377caa7ed00fa41808342002138f9" + }, + "htmltools": { + "Package": "htmltools", + "Version": "0.5.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "base64enc", + "digest", + "ellipsis", + "fastmap", + "grDevices", + "rlang", + "utils" + ], + "Hash": "a2326a66919a3311f7fbb1e3bf568283" + }, + "htmlwidgets": { + "Package": "htmlwidgets", + "Version": "1.6.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grDevices", + "htmltools", + "jsonlite", + "knitr", + "rmarkdown", + "yaml" + ], + "Hash": "a865aa85bcb2697f47505bfd70422471" + }, + "igraph": { + "Package": "igraph", + "Version": "1.5.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "cli", + "cpp11", + "grDevices", + "graphics", + "lifecycle", + "magrittr", + "methods", + "pkgconfig", + "rlang", + "stats", + "utils" + ], + "Hash": "80401cb5ec513e8ddc56764d03f63669" + }, + "isoband": { + "Package": "isoband", + "Version": "0.2.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grid", + "utils" + ], + "Hash": "0080607b4a1a7b28979aecef976d8bc2" + }, + "jquerylib": { + "Package": "jquerylib", + "Version": "0.1.4", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "htmltools" + ], + "Hash": "5aab57a3bd297eee1c1d862735972182" + }, + "jsonlite": { + "Package": "jsonlite", + "Version": "1.8.7", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "methods" + ], + "Hash": "266a20443ca13c65688b2116d5220f76" + }, + "knitr": { + "Package": "knitr", + "Version": "1.43", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "evaluate", + "highr", + "methods", + "tools", + "xfun", + "yaml" + ], + "Hash": "9775eb076713f627c07ce41d8199d8f6" + }, + "labeling": { + "Package": "labeling", + "Version": "0.4.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "graphics", + "stats" + ], + "Hash": "b64ec208ac5bc1852b285f665d6368b3" + }, + "lattice": { + "Package": "lattice", + "Version": "0.21-8", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "grid", + "stats", + "utils" + ], + "Hash": "0b8a6d63c8770f02a8b5635f3c431e6b" + }, + "lifecycle": { + "Package": "lifecycle", + "Version": "1.0.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "cli", + "glue", + "rlang" + ], + "Hash": "001cecbeac1cff9301bdc3775ee46a86" + }, + "magrittr": { + "Package": "magrittr", + "Version": "2.0.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "7ce2733a9826b3aeb1775d56fd305472" + }, + "memoise": { + "Package": "memoise", + "Version": "2.0.1", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "cachem", + "rlang" + ], + "Hash": "e2817ccf4a065c5d9d7f2cfbe7c1d78c" + }, + "mgcv": { + "Package": "mgcv", + "Version": "1.9-0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "graphics", + "methods", + "nlme", + "splines", + "stats", + "utils" + ], + "Hash": "086028ca0460d0c368028d3bda58f31b" + }, + "mime": { + "Package": "mime", + "Version": "0.12", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "tools" + ], + "Hash": "18e9c28c1d3ca1560ce30658b22ce104" + }, + "munsell": { + "Package": "munsell", + "Version": "0.5.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "colorspace", + "methods" + ], + "Hash": "6dfe8bf774944bd5595785e3229d8771" + }, + "nlme": { + "Package": "nlme", + "Version": "3.1-163", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "graphics", + "lattice", + "stats", + "utils" + ], + "Hash": "8d1938040a05566f4f7a14af4feadd6b" + }, + "pillar": { + "Package": "pillar", + "Version": "1.9.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "cli", + "fansi", + "glue", + "lifecycle", + "rlang", + "utf8", + "utils", + "vctrs" + ], + "Hash": "15da5a8412f317beeee6175fbc76f4bb" + }, + "pkgconfig": { + "Package": "pkgconfig", + "Version": "2.0.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "utils" + ], + "Hash": "01f28d4278f15c76cddbea05899c5d6f" + }, + "plyr": { + "Package": "plyr", + "Version": "1.8.8", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "Rcpp" + ], + "Hash": "d744387aef9047b0b48be2933d78e862" + }, + "prettyunits": { + "Package": "prettyunits", + "Version": "1.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "95ef9167b75dde9d2ccc3c7528393e7e" + }, + "progress": { + "Package": "progress", + "Version": "1.2.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R6", + "crayon", + "hms", + "prettyunits" + ], + "Hash": "14dc9f7a3c91ebb14ec5bb9208a07061" + }, + "purrr": { + "Package": "purrr", + "Version": "1.0.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "lifecycle", + "magrittr", + "rlang", + "vctrs" + ], + "Hash": "1cba04a4e9414bdefc9dcaa99649a8dc" + }, + "rappdirs": { + "Package": "rappdirs", + "Version": "0.3.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "5e3c5dc0b071b21fa128676560dbe94d" + }, + "readr": { + "Package": "readr", + "Version": "2.1.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "cli", + "clipr", + "cpp11", + "crayon", + "hms", + "lifecycle", + "methods", + "rlang", + "tibble", + "tzdb", + "utils", + "vroom" + ], + "Hash": "b5047343b3825f37ad9d3b5d89aa1078" + }, + "renv": { + "Package": "renv", + "Version": "1.0.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "utils" + ], + "Hash": "4b22ac016fe54028b88d0c68badbd061" + }, + "rlang": { + "Package": "rlang", + "Version": "1.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "a85c767b55f0bf9b7ad16c6d7baee5bb" + }, + "rmarkdown": { + "Package": "rmarkdown", + "Version": "2.24", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "bslib", + "evaluate", + "fontawesome", + "htmltools", + "jquerylib", + "jsonlite", + "knitr", + "methods", + "stringr", + "tinytex", + "tools", + "utils", + "xfun", + "yaml" + ], + "Hash": "3854c37590717c08c32ec8542a2e0a35" + }, + "rstudioapi": { + "Package": "rstudioapi", + "Version": "0.15.0", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "5564500e25cffad9e22244ced1379887" + }, + "sass": { + "Package": "sass", + "Version": "0.4.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R6", + "fs", + "htmltools", + "rappdirs", + "rlang" + ], + "Hash": "6bd4d33b50ff927191ec9acbf52fd056" + }, + "scales": { + "Package": "scales", + "Version": "1.2.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "RColorBrewer", + "farver", + "labeling", + "lifecycle", + "munsell", + "rlang", + "viridisLite" + ], + "Hash": "906cb23d2f1c5680b8ce439b44c6fa63" + }, + "stringi": { + "Package": "stringi", + "Version": "1.7.12", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "stats", + "tools", + "utils" + ], + "Hash": "ca8bd84263c77310739d2cf64d84d7c9" + }, + "stringr": { + "Package": "stringr", + "Version": "1.5.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "magrittr", + "rlang", + "stringi", + "vctrs" + ], + "Hash": "671a4d384ae9d32fc47a14e98bfa3dc8" + }, + "tibble": { + "Package": "tibble", + "Version": "3.2.1", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "fansi", + "lifecycle", + "magrittr", + "methods", + "pillar", + "pkgconfig", + "rlang", + "utils", + "vctrs" + ], + "Hash": "a84e2cc86d07289b3b6f5069df7a004c" + }, + "tidyr": { + "Package": "tidyr", + "Version": "1.3.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "cpp11", + "dplyr", + "glue", + "lifecycle", + "magrittr", + "purrr", + "rlang", + "stringr", + "tibble", + "tidyselect", + "utils", + "vctrs" + ], + "Hash": "e47debdc7ce599b070c8e78e8ac0cfcf" + }, + "tidyselect": { + "Package": "tidyselect", + "Version": "1.2.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "rlang", + "vctrs", + "withr" + ], + "Hash": "79540e5fcd9e0435af547d885f184fd5" + }, + "tinytex": { + "Package": "tinytex", + "Version": "0.46", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "xfun" + ], + "Hash": "0c41a73214d982f539c56a7773c7afa5" + }, + "tzdb": { + "Package": "tzdb", + "Version": "0.4.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "cpp11" + ], + "Hash": "f561504ec2897f4d46f0c7657e488ae1" + }, + "utf8": { + "Package": "utf8", + "Version": "1.2.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R" + ], + "Hash": "1fe17157424bb09c48a8b3b550c753bc" + }, + "vctrs": { + "Package": "vctrs", + "Version": "0.6.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "rlang" + ], + "Hash": "d0ef2856b83dc33ea6e255caf6229ee2" + }, + "viridis": { + "Package": "viridis", + "Version": "0.6.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "ggplot2", + "gridExtra", + "viridisLite" + ], + "Hash": "80cd127bc8c9d3d9f0904ead9a9102f1" + }, + "viridisLite": { + "Package": "viridisLite", + "Version": "0.4.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "c826c7c4241b6fc89ff55aaea3fa7491" + }, + "visNetwork": { + "Package": "visNetwork", + "Version": "2.1.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "htmltools", + "htmlwidgets", + "jsonlite", + "magrittr", + "methods", + "stats", + "utils" + ], + "Hash": "3e48b097e8d9a91ecced2ed4817a678d" + }, + "vroom": { + "Package": "vroom", + "Version": "1.6.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "bit64", + "cli", + "cpp11", + "crayon", + "glue", + "hms", + "lifecycle", + "methods", + "progress", + "rlang", + "stats", + "tibble", + "tidyselect", + "tzdb", + "vctrs", + "withr" + ], + "Hash": "8318e64ffb3a70e652494017ec455561" + }, + "withr": { + "Package": "withr", + "Version": "2.5.0", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "grDevices", + "graphics", + "stats" + ], + "Hash": "c0e49a9760983e81e55cdd9be92e7182" + }, + "xfun": { + "Package": "xfun", + "Version": "0.40", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "stats", + "tools" + ], + "Hash": "be07d23211245fc7d4209f54c4e4ffc8" + }, + "yaml": { + "Package": "yaml", + "Version": "2.3.7", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "0d0056cc5383fbc240ccd0cb584bf436" + } + } +} diff --git a/results/lifeExp.png b/results/lifeExp.png new file mode 100644 index 000000000..1be23f640 Binary files /dev/null and b/results/lifeExp.png differ diff --git a/safari-pinned-tab.svg b/safari-pinned-tab.svg new file mode 100644 index 000000000..8a74e60c8 --- /dev/null +++ b/safari-pinned-tab.svg @@ -0,0 +1,68 @@ + + + + +Created by potrace 1.14, written by Peter Selinger 2001-2017 + + + + + + + + diff --git a/site.webmanifest b/site.webmanifest new file mode 100644 index 000000000..f2302ffdd --- /dev/null +++ b/site.webmanifest @@ -0,0 +1,19 @@ +{ + "name": "The Carpentries", + "short_name": "The Carpentries", + "icons": [ + { + "src": "/android-chrome-192x192.png", + "sizes": "192x192", + "type": "image/png" + }, + { + "src": "/android-chrome-512x512.png", + "sizes": "512x512", + "type": "image/png" + } + ], + "theme_color": "#ffffff", + "background_color": "#ffffff", + "display": "standalone" +} diff --git a/sitemap.xml b/sitemap.xml new file mode 100644 index 000000000..1bbe1821e --- /dev/null +++ b/sitemap.xml @@ -0,0 +1,141 @@ + + + + https://swcarpentry.github.io/r-novice-gapminder/01-rstudio-intro.html + + + https://swcarpentry.github.io/r-novice-gapminder/02-project-intro.html + + + https://swcarpentry.github.io/r-novice-gapminder/03-seeking-help.html + + + https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html + + + https://swcarpentry.github.io/r-novice-gapminder/05-data-structures-part2.html + + + https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting.html + + + https://swcarpentry.github.io/r-novice-gapminder/07-control-flow.html + + + https://swcarpentry.github.io/r-novice-gapminder/08-plot-ggplot2.html + + + https://swcarpentry.github.io/r-novice-gapminder/09-vectorization.html + + + https://swcarpentry.github.io/r-novice-gapminder/10-functions.html + + + https://swcarpentry.github.io/r-novice-gapminder/11-writing-data.html + + + https://swcarpentry.github.io/r-novice-gapminder/12-plyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/13-dplyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/14-tidyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/15-knitr-markdown.html + + + https://swcarpentry.github.io/r-novice-gapminder/16-wrap-up.html + + + https://swcarpentry.github.io/r-novice-gapminder/404.html + + + https://swcarpentry.github.io/r-novice-gapminder/CODE_OF_CONDUCT.html + + + https://swcarpentry.github.io/r-novice-gapminder/LICENSE.html + + + https://swcarpentry.github.io/r-novice-gapminder/discuss.html + + + https://swcarpentry.github.io/r-novice-gapminder/index.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/01-rstudio-intro.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/02-project-intro.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/03-seeking-help.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/04-data-structures-part1.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/05-data-structures-part2.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/06-data-subsetting.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/07-control-flow.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/08-plot-ggplot2.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/09-vectorization.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/10-functions.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/11-writing-data.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/12-plyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/13-dplyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/14-tidyr.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/15-knitr-markdown.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/16-wrap-up.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/404.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/CODE_OF_CONDUCT.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/LICENSE.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/discuss.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/index.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/profiles.html + + + https://swcarpentry.github.io/r-novice-gapminder/instructor/reference.html + + + https://swcarpentry.github.io/r-novice-gapminder/profiles.html + + + https://swcarpentry.github.io/r-novice-gapminder/reference.html + +