Describe the purpose and use of each pane in the RStudio IDE
+
Locate buttons and options in the RStudio IDE
+
Define a variable
+
Assign data to a variable
+
Manage a workspace in an interactive R session
+
Use mathematical and comparison operators
+
Call functions
+
Manage packages
+
+
+
+
+
+
Motivation
+
+
Science is a multi-step process: once you’ve designed an experiment
+and collected data, the real fun begins! This lesson will teach you how
+to start this process using R and RStudio. We will begin with raw data,
+perform exploratory analyses, and learn how to plot results graphically.
+This example starts with a dataset from gapminder.org containing population
+information for many countries through time. Can you read the data into
+R? Can you plot the population for Senegal? Can you calculate the
+average income for countries on the continent of Asia? By the end of
+these lessons you will be able to do things like plot the populations
+for all of these countries in under a minute!
+
Before Starting The Workshop
+
+
Please ensure you have the latest version of R and RStudio installed
+on your machine. This is important, as some packages used in the
+workshop may not install correctly (or at all) if R is not up to
+date.
Welcome to the R portion of the Software Carpentry workshop.
+
Throughout this lesson, we’re going to teach you some of the
+fundamentals of the R language as well as some best practices for
+organizing code for scientific projects that will make your life
+easier.
+
We’ll be using RStudio: a free, open-source R Integrated Development
+Environment (IDE). It provides a built-in editor, works on all platforms
+(including on servers) and provides many advantages such as integration
+with version control and project management.
+
Basic layout
+
When you first open RStudio, you will be greeted by three panels:
+
The interactive R console/Terminal (entire left)
+
Environment/History/Connections (tabbed in upper right)
+
Files/Plots/Packages/Help/Viewer (tabbed in lower right)
+
Once you open files, such as R scripts, an editor panel will also
+open in the top left.
+
+
+
+
+
+
R scripts
+
+
+
Any commands that you write in the R console can be saved to a file
+to be re-run again. Files containing R code to be ran in this way are
+called R scripts. R scripts have .R at the end of their
+names to let you know what they are.
+
+
+
+
Workflow within RStudio
+
+
There are two main ways one can work within RStudio:
+
Test and play within the interactive R console then copy code into a
+.R file to run later.
+
This works well when doing small tests and initially starting
+off.
+
It quickly becomes laborious
+
Start writing in a .R file and use RStudio’s short cut keys for the
+Run command to push the current line, selected lines or modified lines
+to the interactive R console.
+
This is a great way to start; all your code is saved for later
+
You will be able to run the file you create from within RStudio or
+using R’s source() function.
+
+
+
+
+
+
Tip: Running segments of your code
+
+
+
RStudio offers you great flexibility in running code from within the
+editor window. There are buttons, menu choices, and keyboard shortcuts.
+To run the current line, you can
+
click on the Run button above the editor panel, or
+
select “Run Lines” from the “Code” menu, or
+
hit Ctrl+Return in Windows or Linux or
+⌘+Return on OS X. (This shortcut can also be seen
+by hovering the mouse over the button). To run a block of code, select
+it and then Run. If you have modified a line of code within
+a block of code you have just run, there is no need to reselect the
+section and Run, you can use the next button along,
+Re-run the previous region. This will run the previous code
+block including the modifications you have made.
+
+
+
+
Introduction to R
+
+
Much of your time in R will be spent in the R interactive console.
+This is where you will run all of your code, and can be a useful
+environment to try out ideas before adding them to an R script file.
+This console in RStudio is the same as the one you would get if you
+typed in R in your command-line environment.
+
The first thing you will see in the R interactive session is a bunch
+of information, followed by a “>” and a blinking cursor. In many ways
+this is similar to the shell environment you learned about during the
+shell lessons: it operates on the same idea of a “Read, evaluate, print
+loop”: you type in commands, R tries to execute them, and then returns a
+result.
+
Using R as a calculator
+
+
The simplest thing you could do with R is to do arithmetic:
+
+
R
+
+
+1+100
+
+
+
OUTPUT
+
+
[1] 101
+
+
And R will print out the answer, with a preceding “[1]”. [1] is the
+index of the first element of the line being printed in the console. For
+more information on indexing vectors, see Episode
+6: Subsetting Data.
+
If you type in an incomplete command, R will wait for you to complete
+it. If you are familiar with Unix Shell’s bash, you may recognize
+this
+behavior from bash.
+
+
R
+
+
>1+
+
+
+
OUTPUT
+
+
+
+
+
Any time you hit return and the R session shows a “+” instead of a
+“>”, it means it’s waiting for you to complete the command. If you
+want to cancel a command you can hit Esc and RStudio will
+give you back the “>” prompt.
+
+
+
+
+
+
Tip: Canceling commands
+
+
+
If you’re using R from the command line instead of from within
+RStudio, you need to use Ctrl+C instead of
+Esc to cancel the command. This applies to Mac users as
+well!
+
Canceling a command isn’t only useful for killing incomplete
+commands: you can also use it to tell R to stop running code (for
+example if it’s taking much longer than you expect), or to get rid of
+the code you’re currently writing.
+
+
+
+
When using R as a calculator, the order of operations is the same as
+you would have learned back in school.
+
From highest to lowest precedence:
+
Parentheses: (, )
+
+
Exponents: ^ or **
+
+
Multiply: *
+
+
Divide: /
+
+
Add: +
+
+
Subtract: -
+
+
+
R
+
+
+3+5*2
+
+
+
OUTPUT
+
+
[1] 13
+
+
Use parentheses to group operations in order to force the order of
+evaluation if it differs from the default, or to make clear what you
+intend.
+
+
R
+
+
+(3+5)*2
+
+
+
OUTPUT
+
+
[1] 16
+
+
This can get unwieldy when not needed, but clarifies your intentions.
+Remember that others may later read your code.
+
+
R
+
+
+(3+(5*(2^2)))# hard to read
+3+5*2^2# clear, if you remember the rules
+3+5*(2^2)# if you forget some rules, this might help
+
+
The text after each line of code is called a “comment”. Anything that
+follows after the hash (or octothorpe) symbol # is ignored
+by R when it executes code.
+
Really small or large numbers get a scientific notation:
+
+
R
+
+
+2/10000
+
+
+
OUTPUT
+
+
[1] 2e-04
+
+
Which is shorthand for “multiplied by 10^XX”. So
+2e-4 is shorthand for 2 * 10^(-4).
+
You can write numbers in scientific notation too:
+
+
R
+
+
+5e3# Note the lack of minus here
+
+
+
OUTPUT
+
+
[1] 5000
+
+
Mathematical functions
+
+
R has many built in mathematical functions. To call a function, we
+can type its name, followed by open and closing parentheses. Functions
+take arguments as inputs, anything we type inside the parentheses of a
+function is considered an argument. Depending on the function, the
+number of arguments can vary from none to multiple. For example:
+
+
R
+
+
+getwd()#returns an absolute filepath
+
+
doesn’t require an argument, whereas for the next set of mathematical
+functions we will need to supply the function a value in order to
+compute the result.
+
+
R
+
+
+sin(1)# trigonometry functions
+
+
+
OUTPUT
+
+
[1] 0.841471
+
+
+
R
+
+
+log(1)# natural logarithm
+
+
+
OUTPUT
+
+
[1] 0
+
+
+
R
+
+
+log10(10)# base-10 logarithm
+
+
+
OUTPUT
+
+
[1] 1
+
+
+
R
+
+
+exp(0.5)# e^(1/2)
+
+
+
OUTPUT
+
+
[1] 1.648721
+
+
Don’t worry about trying to remember every function in R. You can
+look them up on Google, or if you can remember the start of the
+function’s name, use the tab completion in RStudio.
+
This is one advantage that RStudio has over R on its own, it has
+auto-completion abilities that allow you to more easily look up
+functions, their arguments, and the values that they take.
+
Typing a ? before the name of a command will open the
+help page for that command. When using RStudio, this will open the
+‘Help’ pane; if using R in the terminal, the help page will open in your
+browser. The help page will include a detailed description of the
+command and how it works. Scrolling to the bottom of the help page will
+usually show a collection of code examples which illustrate command
+usage. We’ll go through an example later.
+
Comparing things
+
+
We can also do comparisons in R:
+
+
R
+
+
+1==1# equality (note two equals signs, read as "is equal to")
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1!=2# inequality (read as "is not equal to")
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1<2# less than
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1<=1# less than or equal to
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1>0# greater than
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1>=-9# greater than or equal to
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
+
+
+
+
Tip: Comparing Numbers
+
+
+
A word of warning about comparing numbers: you should never use
+== to compare two numbers unless they are integers (a data
+type which can specifically represent only whole numbers).
+
Computers may only represent decimal numbers with a certain degree of
+precision, so two numbers which look the same when printed out by R, may
+actually have different underlying representations and therefore be
+different by a small margin of error (called Machine numeric
+tolerance).
We can store values in variables using the assignment operator
+<-, like this:
+
+
R
+
+
+x<-1/40
+
+
Notice that assignment does not print a value. Instead, we stored it
+for later in something called a variable.
+x now contains the value
+0.025:
+
+
R
+
+
+x
+
+
+
OUTPUT
+
+
[1] 0.025
+
+
More precisely, the stored value is a decimal approximation
+of this fraction called a floating point
+number.
+
Look for the Environment tab in the top right panel of
+RStudio, and you will see that x and its value have
+appeared. Our variable x can be used in place of a number
+in any calculation that expects a number:
+
+
R
+
+
+log(x)
+
+
+
OUTPUT
+
+
[1] -3.688879
+
+
Notice also that variables can be reassigned:
+
+
R
+
+
+x<-100
+
+
x used to contain the value 0.025 and now it has the
+value 100.
+
Assignment values can contain the variable being assigned to:
+
+
R
+
+
+x<-x+1#notice how RStudio updates its description of x on the top right tab
+y<-x*2
+
+
The right hand side of the assignment can be any valid R expression.
+The right hand side is fully evaluated before the assignment
+occurs.
+
Variable names can contain letters, numbers, underscores and periods
+but no spaces. They must start with a letter or a period followed by a
+letter (they cannot start with a number nor an underscore). Variables
+beginning with a period are hidden variables. Different people use
+different conventions for long variable names, these include
+
periods.between.words
+
underscores_between_words
+
camelCaseToSeparateWords
+
What you use is up to you, but be consistent.
+
It is also possible to use the = operator for
+assignment:
+
+
R
+
+
+x=1/40
+
+
But this is much less common among R users. The most important thing
+is to be consistent with the operator you use. There
+are occasionally places where it is less confusing to use
+<- than =, and it is the most common symbol
+used in the community. So the recommendation is to use
+<-.
+
+
+
+
+
+
Challenge 1
+
+
+
Which of the following are valid R variable names?
The following will not be able to be used to create a variable
+
+
R
+
+
_age
+min-length
+2widths
+
+
+
+
+
+
Vectorization
+
+
One final thing to be aware of is that R is vectorized,
+meaning that variables and functions can have vectors as values. In
+contrast to physics and mathematics, a vector in R describes a set of
+values in a certain order of the same data type. For example
+
+
R
+
+
+1:5
+
+
+
OUTPUT
+
+
[1] 1 2 3 4 5
+
+
+
R
+
+
+2^(1:5)
+
+
+
OUTPUT
+
+
[1] 2 4 8 16 32
+
+
+
R
+
+
+x<-1:5
+2^x
+
+
+
OUTPUT
+
+
[1] 2 4 8 16 32
+
+
This is incredibly powerful; we will discuss this further in an
+upcoming lesson.
+
Managing your environment
+
+
There are a few useful commands you can use to interact with the R
+session.
+
ls will list all of the variables and functions stored
+in the global environment (your working R session):
+
+
R
+
+
+ls()
+
+
+
OUTPUT
+
+
[1] "x" "y"
+
+
+
+
+
+
+
Tip: hidden objects
+
+
+
Like in the shell, ls will hide any variables or
+functions starting with a “.” by default. To list all objects, type
+ls(all.names=TRUE) instead
+
+
+
+
Note here that we didn’t give any arguments to ls, but
+we still needed to give the parentheses to tell R to call the
+function.
+
If we type ls by itself, R prints a bunch of code
+instead of a listing of objects.
+
+
R
+
+
+ls
+
+
+
OUTPUT
+
+
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
+ pattern, sorted = TRUE)
+{
+ if (!missing(name)) {
+ pos <- tryCatch(name, error = function(e) e)
+ if (inherits(pos, "error")) {
+ name <- substitute(name)
+ if (!is.character(name))
+ name <- deparse(name)
+ warning(gettextf("%s converted to character string",
+ sQuote(name)), domain = NA)
+ pos <- name
+ }
+ }
+ all.names <- .Internal(ls(envir, all.names, sorted))
+ if (!missing(pattern)) {
+ if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
+ ll != length(grep("]", pattern, fixed = TRUE))) {
+ if (pattern == "[") {
+ pattern <- "\\["
+ warning("replaced regular expression pattern '[' by '\\\\['")
+ }
+ else if (length(grep("[^\\\\]\\[<-", pattern))) {
+ pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+ warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+ }
+ }
+ grep(pattern, all.names, value = TRUE)
+ }
+ else all.names
+}
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>
+
+
What’s going on here?
+
Like everything in R, ls is the name of an object, and
+entering the name of an object by itself prints the contents of the
+object. The object x that we created earlier contains 1, 2,
+3, 4, 5:
+
+
R
+
+
+x
+
+
+
OUTPUT
+
+
[1] 1 2 3 4 5
+
+
The object ls contains the R code that makes the
+ls function work! We’ll talk more about how functions work
+and start writing our own later.
+
You can use rm to delete objects you no longer need:
+
+
R
+
+
+rm(x)
+
+
If you have lots of things in your environment and want to delete all
+of them, you can pass the results of ls to the
+rm function:
+
+
R
+
+
+rm(list =ls())
+
+
In this case we’ve combined the two. Like the order of operations,
+anything inside the innermost parentheses is evaluated first, and so
+on.
+
In this case we’ve specified that the results of ls
+should be used for the list argument in rm.
+When assigning values to arguments by name, you must use the
+= operator!!
+
If instead we use <-, there will be unintended side
+effects, or you may get an error message:
+
+
R
+
+
+rm(list<-ls())
+
+
+
ERROR
+
+
Error in rm(list <- ls()): ... must contain names or character strings
+
+
+
+
+
+
+
Tip: Warnings vs. Errors
+
+
+
Pay attention when R does something unexpected! Errors, like above,
+are thrown when R cannot proceed with a calculation. Warnings on the
+other hand usually mean that the function has run, but it probably
+hasn’t worked as expected.
+
In both cases, the message that R prints out usually give you clues
+how to fix a problem.
+
+
+
+
R Packages
+
+
It is possible to add functions to R by writing a package, or by
+obtaining a package written by someone else. As of this writing, there
+are over 10,000 packages available on CRAN (the comprehensive R archive
+network). R and RStudio have functionality for managing packages:
+
You can see what packages are installed by typing
+installed.packages()
+
+
You can install packages by typing
+install.packages("packagename"), where
+packagename is the package name, in quotes.
+
You can update installed packages by typing
+update.packages()
+
+
You can remove a package with
+remove.packages("packagename")
+
+
You can make a package available for use with
+library(packagename)
+
+
Packages can also be viewed, loaded, and detached in the Packages tab
+of the lower right panel in RStudio. Clicking on this tab will display
+all of the installed packages with a checkbox next to them. If the box
+next to a package name is checked, the package is loaded and if it is
+empty, the package is not loaded. Click an empty box to load that
+package and click a checked box to detach that package.
+
Packages can be installed and updated from the Package tab with the
+Install and Update buttons at the top of the tab.
+
+
+
+
+
+
Challenge 2
+
+
+
What will be the value of each variable after each statement in the
+following program?
The scientific process is naturally incremental, and many projects
+start life as random notes, some code, then a manuscript, and eventually
+everything is a bit mixed together.
+
+
+Managing your projects in a reproducible fashion doesn’t just make your
+science reproducible, it makes your life easier.
+
Most people tend to organize their projects like this:
+
There are many reasons why we should ALWAYS avoid this:
+
It is really hard to tell which version of your data is the original
+and which is the modified;
+
It gets really messy because it mixes files with various extensions
+together;
+
It probably takes you a lot of time to actually find things, and
+relate the correct figures to the exact code that has been used to
+generate it;
+
A good project layout will ultimately make your life easier:
+
It will help ensure the integrity of your data;
+
It makes it simpler to share your code with someone else (a
+lab-mate, collaborator, or supervisor);
+
It allows you to easily upload your code with your manuscript
+submission;
+
It makes it easier to pick the project back up after a break.
+
A possible solution
+
+
Fortunately, there are tools and packages which can help you manage
+your work effectively.
+
One of the most powerful and useful aspects of RStudio is its project
+management functionality. We’ll be using this today to create a
+self-contained, reproducible project.
+
+
+
+
+
+
Challenge 1: Creating a self-contained
+project
+
+
+
We’re going to create a new project in RStudio:
+
Click the “File” menu button, then “New Project”.
+
Click “New Directory”.
+
Click “New Project”.
+
Type in the name of the directory to store your project,
+e.g. “my_project”.
+
If available, select the checkbox for “Create a git
+repository.”
+
Click the “Create Project” button.
+
+
+
+
The simplest way to open an RStudio project once it has been created
+is to click through your file system to get to the directory where it
+was saved and double click on the .Rproj file. This will
+open RStudio and start your R session in the same directory as the
+.Rproj file. All your data, plots and scripts will now be
+relative to the project directory. RStudio projects have the added
+benefit of allowing you to open multiple projects at the same time each
+open to its own project directory. This allows you to keep multiple
+projects open without them interfering with each other.
+
+
+
+
+
+
Challenge 2: Opening an RStudio project
+through the file system
+
+
+
Exit RStudio.
+
Navigate to the directory where you created a project in Challenge
+1.
+
Double click on the .Rproj file in that directory.
+
+
+
+
Best practices for project organization
+
+
Although there is no “best” way to lay out a project, there are some
+general principles to adhere to that will make project management
+easier:
+
+
Treat data as read only
+
This is probably the most important goal of setting up a project.
+Data is typically time consuming and/or expensive to collect. Working
+with them interactively (e.g., in Excel) where they can be modified
+means you are never sure of where the data came from, or how it has been
+modified since collection. It is therefore a good idea to treat your
+data as “read-only”.
+
+
+
Data Cleaning
+
In many cases your data will be “dirty”: it will need significant
+preprocessing to get into a format R (or any other programming language)
+will find useful. This task is sometimes called “data munging”. Storing
+these scripts in a separate folder, and creating a second “read-only”
+data folder to hold the “cleaned” data sets can prevent confusion
+between the two sets.
+
+
+
Treat generated output as disposable
+
Anything generated by your scripts should be treated as disposable:
+it should all be able to be regenerated from your scripts.
+
There are lots of different ways to manage this output. Having an
+output folder with different sub-directories for each separate analysis
+makes it easier later. Since many analyses are exploratory and don’t end
+up being used in the final project, and some of the analyses get shared
+between projects.
+
+
+
+
+
+
Tip: Good Enough Practices for Scientific
+Computing
+
Put each project in its own directory, which is named after the
+project.
+
Put text documents associated with the project in the
+doc directory.
+
Put raw data and metadata in the data directory, and
+files generated during cleanup and analysis in a results
+directory.
+
Put source for the project’s scripts and programs in the
+src directory, and programs brought in from elsewhere or
+compiled locally in the bin directory.
+
Name all files to reflect their content or function.
+
+
+
+
+
+
Separate function definition and application
+
One of the more effective ways to work with R is to start by writing
+the code you want to run directly in a .R script, and then running the
+selected lines (either using the keyboard shortcuts in RStudio or
+clicking the “Run” button) in the interactive R console.
+
When your project is in its early stages, the initial .R script file
+usually contains many lines of directly executed code. As it matures,
+reusable chunks get pulled into their own functions. It’s a good idea to
+separate these functions into two separate folders; one to store useful
+functions that you’ll reuse across analyses and projects, and one to
+store the analysis scripts.
+
+
+
Save the data in the data directory
+
Now we have a good directory structure we will now place/save the
+data file in the data/ directory.
Download the file (right mouse click on the link above -> “Save
+link as” / “Save file as”, or click on the link and after the page
+loads, press Ctrl+S or choose File -> “Save
+page as”)
+
Make sure it’s saved under the name
+gapminder_data.csv
+
+
Save the file in the data/ folder within your
+project.
+
We will load and inspect these data later.
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
It is useful to get some general idea about the dataset, directly
+from the command line, before loading it into R. Understanding the
+dataset better will come in handy when making decisions on how to load
+it in R. Use the command-line shell to answer the following
+questions:
+
What is the size of the file?
+
How many rows of data does it contain?
+
What kinds of values are stored in this file?
+
+
+
+
+
+
+
+
+
By running these commands in the shell:
+
+
SH
+
+
ls-lh data/gapminder_data.csv
+
+
+
OUTPUT
+
+
-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv
The Terminal tab in the console pane provides a convenient place
+directly within RStudio to interact directly with the command line.
+
+
+
+
+
+
Working directory
+
Knowing R’s current working directory is important because when you
+need to access other files (for example, to import a data file), R will
+look for them relative to the current working directory.
+
Each time you create a new RStudio Project, it will create a new
+directory for that project. When you open an existing
+.Rproj file, it will open that project and set R’s working
+directory to the folder that file is in.
+
+
+
+
+
+
Challenge 5
+
+
+
You can check the current working directory with the
+getwd() command, or by using the menus in RStudio.
+
In the console, type getwd() (“wd” is short for
+“working directory”) and hit Enter.
+
In the Files pane, double click on the data folder to
+open it (or navigate to any other folder you wish). To get the Files
+pane back to the current working directory, click “More” and then select
+“Go To Working Directory”.
+
You can change the working directory with setwd(), or by
+using RStudio menus.
+
In the console, type setwd("data") and hit Enter. Type
+getwd() and hit Enter to see the new working
+directory.
+
In the menus at the top of the RStudio window, click the “Session”
+menu button, and then select “Set Working Directory” and then “Choose
+Directory”. Next, in the windows navigator that opens, navigate back to
+the project directory, and click “Open”. Note that a setwd
+command will automatically appear in the console.
+
+
+
+
+
+
+
+
+
Tip: File does not exist errors
+
+
+
When you’re attempting to reference a file in your R code and you’re
+getting errors saying the file doesn’t exist, it’s a good idea to check
+your working directory. You need to either provide an absolute path to
+the file, or you need to make sure the file is saved in the working
+directory (or a subfolder of the working directory) and provide a
+relative path.
To be able to read R help files for functions and special
+operators.
+
To be able to use CRAN task views to identify packages to solve a
+problem.
+
To be able to seek help from your peers.
+
+
+
+
+
+
Reading Help Files
+
+
R, and every package, provide help files for functions. The general
+syntax to search for help on any function, “function_name”, from a
+specific function that is in a package loaded into your namespace (your
+interactive R session) is:
+
+
R
+
+
+?function_name
+help(function_name)
+
+
For example take a look at the help file for
+write.table(), we will be using a similar function in an
+upcoming episode.
+
+
R
+
+
+?write.table()
+
+
This will load up a help page in RStudio (or as plain text in R
+itself).
+
Each help page is broken down into sections:
+
Description: An extended description of what the function does.
+
Usage: The arguments of the function and their default values (which
+can be changed).
+
Arguments: An explanation of the data each argument is
+expecting.
+
Details: Any important details to be aware of.
+
Value: The data the function returns.
+
See Also: Any related functions you might find useful.
+
Examples: Some examples for how to use the function.
+
Different functions might have different sections, but these are the
+main ones you should be aware of.
+
Notice how related functions might call for the same help file:
+
+
R
+
+
+?write.table()
+?write.csv()
+
+
This is because these functions have very similar applicability and
+often share the same arguments as inputs to the function, so package
+authors often choose to document them together in a single help
+file.
+
+
+
+
+
+
Tip: Running Examples
+
+
+
From within the function help page, you can highlight code in the
+Examples and hit Ctrl+Return to run it in RStudio
+console. This gives you a quick way to get a feel for how a function
+works.
+
+
+
+
+
+
+
+
+
Tip: Reading Help Files
+
+
+
One of the most daunting aspects of R is the large number of
+functions available. It would be prohibitive, if not impossible to
+remember the correct usage for every function you use. Luckily, using
+the help files means you don’t have to remember that!
+
+
+
+
Special Operators
+
+
To seek help on special operators, use quotes or backticks:
+
+
R
+
+
+?"<-"
+?`<-`
+
+
Getting Help with Packages
+
+
Many packages come with “vignettes”: tutorials and extended example
+documentation. Without any arguments, vignette() will list
+all vignettes for all installed packages;
+vignette(package="package-name") will list all available
+vignettes for package-name, and
+vignette("vignette-name") will open the specified
+vignette.
+
If a package doesn’t have any vignettes, you can usually find help by
+typing help("package-name").
+
RStudio also has a set of excellent cheatsheets for
+many packages.
+
When You Remember Part of the Function Name
+
+
If you’re not sure what package a function is in or how it’s
+specifically spelled, you can do a fuzzy search:
+
+
R
+
+
+??function_name
+
+
A fuzzy search is when you search for an approximate string match.
+For example, you may remember that the function to set your working
+directory includes “set” in its name. You can do a fuzzy search to help
+you identify the function:
+
+
R
+
+
+??set
+
+
When You Have No Idea Where to Begin
+
+
If you don’t know what function or package you need to use CRAN Task Views is a
+specially maintained list of packages grouped into fields. This can be a
+good starting point.
+
When Your Code Doesn’t Work: Seeking Help from Your Peers
+
+
If you’re having trouble using a function, 9 times out of 10, the
+answers you seek have already been answered on Stack Overflow. You can search
+using the [r] tag. Please make sure to see their page on how to ask a good
+question.
+
If you can’t find the answer, there are a few useful functions to
+help you ask your peers:
+
+
R
+
+
+?dput
+
+
Will dump the data you’re working with into a format that can be
+copied and pasted by others into their own R session.
+
+
R
+
+
+sessionInfo()
+
+
+
OUTPUT
+
+
R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
+ [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
+ [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
+[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
+
+time zone: UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1 tools_4.3.1 rstudioapi_0.15.0 yaml_2.3.7
+[5] knitr_1.43 xfun_0.40 renv_1.0.3 evaluate_0.21
+
+
Will print out your current version of R, as well as any packages you
+have loaded. This can be useful for others to help reproduce and debug
+your issue.
+
+
+
+
+
+
Challenge 1
+
+
+
Look at the help page for the c function. What kind of
+vector do you expect will be created if you evaluate the following:
+
+
R
+
+
+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+
+
+
+
+
+
+
+
+
+
The c() function creates a vector, in which all elements
+are of the same type. In the first case, the elements are numeric, in
+the second, they are characters, and in the third they are also
+characters: the numeric values are “coerced” to be characters.
+
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Look at the help for the paste function. You will need
+to use it later. What’s the difference between the sep and
+collapse arguments?
+
+
+
+
+
+
+
+
+
To look at the help for the paste() function, use:
+
+
R
+
+
+help("paste")
+?paste
+
+
The difference between sep and collapse is
+a little tricky. The paste function accepts any number of
+arguments, each of which can be a vector of any length. The
+sep argument specifies the string used between concatenated
+terms — by default, a space. The result is a vector as long as the
+longest argument supplied to paste. In contrast,
+collapse specifies that after concatenation the elements
+are collapsed together using the given separator, the result
+being a single string.
+
It is important to call the arguments explicitly by typing out the
+argument name e.g sep = "," so the function understands to
+use the “,” as a separator and not a term to concatenate. e.g.
+
+
R
+
+
+paste(c("a","b"), "c")
+
+
+
OUTPUT
+
+
[1] "a c" "b c"
+
+
+
R
+
+
+paste(c("a","b"), "c", ",")
+
+
+
OUTPUT
+
+
[1] "a c ," "b c ,"
+
+
+
R
+
+
+paste(c("a","b"), "c", sep =",")
+
+
+
OUTPUT
+
+
[1] "a,c" "b,c"
+
+
+
R
+
+
+paste(c("a","b"), "c", collapse ="|")
+
+
+
OUTPUT
+
+
[1] "a c|b c"
+
+
+
R
+
+
+paste(c("a","b"), "c", sep =",", collapse ="|")
+
+
+
OUTPUT
+
+
[1] "a,c|b,c"
+
+
(For more information, scroll to the bottom of the
+?paste help page and look at the examples, or try
+example('paste').)
+
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Use help to find a function (and its associated parameters) that you
+could use to load data from a tabular file in which columns are
+delimited with “\t” (tab) and the decimal point is a “.” (period). This
+check for decimal separator is important, especially if you are working
+with international colleagues, because different countries have
+different conventions for the decimal point (i.e. comma vs period).
+Hint: use ??"read table" to look up functions related to
+reading in tabular data.
+
+
+
+
+
+
+
+
+
The standard R function for reading tab-delimited files with a period
+decimal separator is read.delim(). You can also do this with
+read.table(file, sep="\t") (the period is the
+default decimal separator for read.table()),
+although you may have to change the comment.char argument
+as well if your data file contains hash (#) characters.
To begin exploring data frames, and understand how they are related
+to vectors and lists.
+
To be able to ask questions from R about the type, class, and
+structure of an object.
+
To understand the information of the attributes “names”, “class”,
+and “dim”.
+
+
+
+
+
+
One of R’s most powerful features is its ability to deal with tabular
+data - such as you may already have in a spreadsheet or a CSV file.
+Let’s start by making a toy dataset in your data/
+directory, called feline-data.csv:
We can now save cats as a CSV file. It is good practice
+to call the argument names explicitly so the function knows what default
+values you are changing. Here we are setting
+row.names = FALSE. Recall you can use
+?write.csv to pull up the help file to check out the
+argument names and their default values.
The read.table function is used for reading in tabular
+data stored in a text file where the columns of data are separated by
+punctuation characters such as CSV files (csv = comma-separated values).
+Tabs and commas are the most common punctuation characters used to
+separate or delimit data points in csv files. For convenience R provides
+2 other versions of read.table. These are:
+read.csv for files where the data are separated with commas
+and read.delim for files where the data are separated with
+tabs. Of these three functions read.csv is the most
+commonly used. If needed it is possible to override the default
+delimiting punctuation marks for both read.csv and
+read.delim.
+
+
+
+
+
+
Check your data for factors
+
+
+
In recent times, the default way how R handles textual data has
+changed. Text data was interpreted by R automatically into a format
+called “factors”. But there is an easier format that is called
+“character”. We will hear about factors later, and what to use them for.
+For now, remember that in most cases, they are not needed and only
+complicate your life, which is why newer R versions read in text as
+“character”. Check now if your version of R has automatically created
+factors and convert them to “character” format:
+
Check the data types of your input by typing
+str(cats)
+
+
In the output, look at the three-letter codes after the colons: If
+you see only “num” and “chr”, you can continue with the lesson and skip
+this box. If you find “fct”, continue to step 3.
+
Prevent R from automatically creating “factor” data. That can be
+done by the following code:
+options(stringsAsFactors = FALSE). Then, re-read the cats
+table for the change to take effect.
+
You must set this option every time you restart R. To not forget
+this, include it in your analysis script before you read in any data,
+for example in one of the first lines.
+
For R versions greater than 4.0.0, text data is no longer converted
+to factors anymore. So you can install this or a newer version to avoid
+this problem. If you are working on an institute or company computer,
+ask your administrator to do it.
+
+
+
+
We can begin exploring our dataset right away, pulling out columns by
+specifying them using the $ operator:
+
+
R
+
+
+cats$weight
+
+
+
OUTPUT
+
+
[1] 2.1 5.0 3.2
+
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
We can do other operations on the columns:
+
+
R
+
+
+## Say we discovered that the scale weighs two Kg light:
+cats$weight+2
+
+
+
OUTPUT
+
+
[1] 4.1 7.0 5.2
+
+
+
R
+
+
+paste("My cat is", cats$coat)
+
+
+
OUTPUT
+
+
[1] "My cat is calico" "My cat is black" "My cat is tabby"
+
+
But what about
+
+
R
+
+
+cats$weight+cats$coat
+
+
+
ERROR
+
+
Error in cats$weight + cats$coat: non-numeric argument to binary operator
+
+
Understanding what happened here is key to successfully analyzing
+data in R.
+
+
Data Types
+
If you guessed that the last command will return an error because
+2.1 plus "black" is nonsense, you’re right -
+and you already have some intuition for an important concept in
+programming called data types. We can ask what type of data
+something is:
+
+
R
+
+
+typeof(cats$weight)
+
+
+
OUTPUT
+
+
[1] "double"
+
+
There are 5 main types: double, integer,
+complex, logical and character.
+For historic reasons, double is also called
+numeric.
+
+
R
+
+
+typeof(3.14)
+
+
+
OUTPUT
+
+
[1] "double"
+
+
+
R
+
+
+typeof(1L)# The L suffix forces the number to be an integer, since by default R uses float numbers
+
+
+
OUTPUT
+
+
[1] "integer"
+
+
+
R
+
+
+typeof(1+1i)
+
+
+
OUTPUT
+
+
[1] "complex"
+
+
+
R
+
+
+typeof(TRUE)
+
+
+
OUTPUT
+
+
[1] "logical"
+
+
+
R
+
+
+typeof('banana')
+
+
+
OUTPUT
+
+
[1] "character"
+
+
No matter how complicated our analyses become, all data in R is
+interpreted as one of these basic data types. This strictness has some
+really important consequences.
+
A user has added details of another cat. This information is in the
+file data/feline-data_v2.csv.
+
+
R
+
+
+file.show("data/feline-data_v2.csv")
+
+
+
R
+
+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+tabby,2.3 or 2.4,1
+
+
Load the new cats data like before, and check what type of data we
+find in the weight column:
Oh no, our weights aren’t the double type anymore! If we try to do
+the same math we did on them before, we run into trouble:
+
+
R
+
+
+cats$weight+2
+
+
+
ERROR
+
+
Error in cats$weight + 2: non-numeric argument to binary operator
+
+
What happened? The cats data we are working with is
+something called a data frame. Data frames are one of the most
+common and versatile types of data structures we will work with
+in R. A given column in a data frame cannot be composed of different
+data types. In this case, R does not read everything in the data frame
+column weight as a double, therefore the entire
+column data type changes to something that is suitable for everything in
+the column.
+
When R reads a csv file, it reads it in as a data frame.
+Thus, when we loaded the cats csv file, it is stored as a
+data frame. We can recognize data frames by the first row that is
+written by the str() function:
Data frames are composed of rows and columns, where each
+column has the same number of rows. Different columns in a data frame
+can be made up of different data types (this is what makes them so
+versatile), but everything in a given column needs to be the same type
+(e.g., vector, factor, or list).
+
Let’s explore more about different data structures and how they
+behave. For now, let’s remove that extra line from our cats data and
+reload it, while we investigate this behavior further:
To better understand this behavior, let’s meet another of the data
+structures: the vector.
+
+
R
+
+
+my_vector<-vector(length =3)
+my_vector
+
+
+
OUTPUT
+
+
[1] FALSE FALSE FALSE
+
+
A vector in R is essentially an ordered list of things, with the
+special condition that everything in the vector must be the same
+basic data type. If you don’t choose the datatype, it’ll default to
+logical; or, you can declare an empty vector of whatever
+type you like.
The somewhat cryptic output from this command indicates the basic
+data type found in this vector - in this case chr,
+character; an indication of the number of things in the vector -
+actually, the indexes of the vector, in this case [1:3];
+and a few examples of what’s actually in the vector - in this case empty
+character strings. If we similarly do
+
+
R
+
+
+str(cats$weight)
+
+
+
OUTPUT
+
+
num [1:3] 2.1 5 3.2
+
+
we see that cats$weight is a vector, too - the
+columns of data we load into R data.frames are all vectors, and
+that’s the root of why R forces everything in a column to be the same
+basic data type.
+
+
+
+
+
+
Discussion 1
+
+
+
Why is R so opinionated about what we put in our columns of data? How
+does this help us?
+
+
+
+
+
+
By keeping everything in a column the same, we allow ourselves to
+make simple assumptions about our data; if you can interpret one entry
+in the column as a number, then you can interpret all of them
+as numbers, so we don’t have to check every time. This consistency is
+what people mean when they talk about clean data; in the long
+run, strict consistency goes a long way to making our lives easier in
+R.
+
+
+
+
+
+
+
+
+
Coercion by combining vectors
+
You can also make vectors with explicit contents with the combine
+function:
+
+
R
+
+
+combine_vector<-c(2,6,3)
+combine_vector
+
+
+
OUTPUT
+
+
[1] 2 6 3
+
+
Given what we’ve learned so far, what do you think the following will
+produce?
+
+
R
+
+
+quiz_vector<-c(2,6,'3')
+
+
This is something called type coercion, and it is the source
+of many surprises and the reason why we need to be aware of the basic
+data types and how R will interpret them. When R encounters a mix of
+types (here double and character) to be combined into a single vector,
+it will force them all to be the same type. Consider:
The coercion rules go: logical ->
+integer -> double (“numeric”)
+-> complex -> character, where -> can
+be read as are transformed into. For example, combining
+logical and character transforms the result to
+character:
+
+
R
+
+
+c('a', TRUE)
+
+
+
OUTPUT
+
+
[1] "a" "TRUE"
+
+
A quick way to recognize character vectors is by the
+quotes that enclose them when they are printed.
+
You can try to force coercion against this flow using the
+as. functions:
As you can see, some surprising things can happen when R forces one
+basic data type into another! Nitty-gritty of type coercion aside, the
+point is: if your data doesn’t look like what you thought it was going
+to look like, type coercion may well be to blame; make sure everything
+is the same type in your vectors and your columns of data.frames, or you
+will get nasty surprises!
+
But coercion can also be very useful! For example, in our
+cats data likes_string is numeric, but we know
+that the 1s and 0s actually represent TRUE and
+FALSE (a common way of representing them). We should use
+the logical datatype here, which has two states:
+TRUE or FALSE, which is exactly what our data
+represents. We can ‘coerce’ this column to be logical by
+using the as.logical function:
An important part of every data analysis is cleaning the input data.
+If you know that the input data is all of the same format,
+(e.g. numbers), your analysis is much easier! Clean the cat data set
+from the chapter about type coercion.
+
+
Copy the code template
+
Create a new script in RStudio and copy and paste the following code.
+Then move on to the tasks below, which help you to fill in the gaps
+(______).
+
# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+
+# 1. Print the data
+_____
+
+# 2. Show an overview of the table with all data types
+_____(cats)
+
+# 3. The "weight" column has the incorrect data type __________.
+# The correct data type is: ____________.
+
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+# print the data again to see the effect
+cats
+
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+
+# Calculate the mean to test yourself
+mean(cats$weight)
+
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!
+
+
+
Instructions for the tasks
+
+
1. Print the data
+
Execute the first statement (read.csv(...)). Then print
+the data to the console
+
+
+
+
+
+
+
+
+
+
+
Show the content of any variable by typing its name.
+
+
Solution to Challenge 1.1
+
Two correct solutions:
+
cats
+print(cats)
+
+
+
+
+
+
+
+
+
+
+
2. Overview of the data types
+
+
+
The data type of your data is as important as the data itself. Use a
+function we saw earlier to print out the data types of all columns of
+the cats table.
+
+
+
+
+
+
+
+
+
In the chapter “Data types” we saw two functions that can show data
+types. One printed just a single word, the data type name. The other
+printed a short form of the data type, and the first few values. We need
+the second here.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to Challenge 1.2
+
str(cats)
+
+
+
3. Which data type do we need?
+
The shown data type is not the right one for this data (weight of a
+cat). Which data type do we need?
+
Why did the read.csv() function not choose the correct
+data type?
+
Fill in the gap in the comment with the correct data type for cat
+weight!
+
+
+
+
+
+
+
+
+
+
Scroll up to the section about the type
+hierarchy to review the available data types
+
+
+
+
+
+
+
+
+
+
Weight is expressed on a continuous scale (real numbers). The R data
+type for this is “double” (also known as “numeric”).
+
The fourth row has the value “2.3 or 2.4”. That is not a number but
+two, and an english word. Therefore, the “character” data type is
+chosen. The whole column is now text, because all values in the same
+columns have to be the same data type.
+
+
+
+
+
+
+
+
+
+
4. Correct the problematic value
+
+
+
The code to assign a new weight value to the problematic fourth row
+is given. Think first and then execute it: What will be the data type
+after assigning a number like in this example? You can check the data
+type after executing to see if you were right.
+
+
+
+
+
+
+
+
+
Revisit the hierarchy of data types when two different data types are
+combined.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to challenge 1.4
+
The data type of the column “weight” is “character”. The assigned
+data type is “double”. Combining two data types yields the data type
+that is higher in the following hierarchy:
+
logical < integer < double < complex < character
+
Therefore, the column is still of type character! We need to manually
+convert it to “double”. {: .solution}
+
+
+
5. Convert the column “weight” to the correct data type
+
Cat weight are numbers. But the column does not have this data type
+yet. Coerce the column to floating point numbers.
+
+
+
+
+
+
+
+
+
+
The functions to convert data types start with as.. You
+can look for the function further up in the manuscript or use the
+RStudio auto-complete function: Type “as.” and then press
+the TAB key.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to Challenge 1.5
+
There are two functions that are synonymous for historic reasons:
To change a single element, use the bracket on the other side of the
+arrow:
+
+
R
+
+
+sequence_example[1]<-30
+sequence_example
+
+
+
OUTPUT
+
+
[1] 30 21 22 23 24 25
+
+
+
+
+
+
+
Challenge 2
+
+
+
Start by making a vector with the numbers 1 through 26. Then,
+multiply the vector by 2.
+
+
+
+
+
+
+
+
+
+
R
+
+
+x<-1:26
+x<-x*2
+
+
+
+
+
+
+
+
Lists
+
Another data structure you’ll want in your bag of tricks is the
+list. A list is simpler in some ways than the other types,
+because you can put anything you want in it. Remember everything in
+the vector must be of the same basic data type, but a list can have
+different data types:
When printing the object structure with str(), we see
+the data types of all elements:
+
+
R
+
+
+str(list_example)
+
+
+
OUTPUT
+
+
List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i
+
+
What is the use of lists? They can organize data of different
+types. For example, you can organize different tables that
+belong together, similar to spreadsheets in Excel. But there are many
+other uses, too.
+
We will see another example that will maybe surprise you in the next
+chapter.
+
To retrieve one of the elements of a list, use the double
+bracket:
+
+
R
+
+
+list_example[[2]]
+
+
+
OUTPUT
+
+
[1] "a"
+
+
The elements of lists also can have names, they can
+be given by prepending them to the values, separated by an equals
+sign:
+
+
R
+
+
+another_list<-list(title ="Numbers", numbers =1:10, data =TRUE)
+another_list
This results in a named list. Now we have a new
+function of our object! We can access single elements by an additional
+way!
+
+
R
+
+
+another_list$title
+
+
+
OUTPUT
+
+
[1] "Numbers"
+
+
+
Names
+
+
With names, we can give meaning to elements. It is the first time
+that we do not only have the data, but also explaining
+information. It is metadata that can be stuck to the object
+like a label. In R, this is called an attribute. Some
+attributes enable us to do more with our object, for example, like here,
+accessing an element by a self-defined name.
+
+
Accessing vectors and lists by name
+
We have already seen how to generate a named list. The way to
+generate a named vector is very similar. You have seen this function
+before:
The way to retrieve elements is different, though:
+
+
R
+
+
+pizza_price["pizzasubito"]
+
+
+
OUTPUT
+
+
pizzasubito
+5.64
+
+
The approach used for the list does not work:
+
+
R
+
+
+pizza_price$pizzafresh
+
+
+
ERROR
+
+
Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors
+
+
It will pay off if you remember this error message, you will meet it
+in your own analyses. It means that you have just tried accessing an
+element like it was in a list, but it is actually in a vector.
+
+
+
Accessing and changing names
+
If you are only interested in the names, use the names()
+function:
+
+
R
+
+
+names(pizza_price)
+
+
+
OUTPUT
+
+
[1] "pizzasubito" "pizzafresh" "callapizza"
+
+
We have seen how to access and change single elements of a vector.
+The same is possible for names:
What is the data type of the names of pizza_price? You
+can find out using the str() or typeof()
+functions.
+
+
+
+
+
+
+
+
+
You get the names of an object by wrapping the object name inside
+names(...). Similarly, you get the data type of the names
+by again wrapping the whole code in typeof(...):
+
typeof(names(pizza))
+
alternatively, use a new variable if this is easier for you to
+read:
+
n<-names(pizza)
+typeof(n)
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Instead of just changing some of the names a vector/list already has,
+you can also set all names of an object by writing code like (replace
+ALL CAPS text):
+
names(OBJECT)<-CHARACTER_VECTOR
+
Create a vector that gives the number for each letter in the
+alphabet!
+
Generate a vector called letter_no with the sequence of
+numbers from 1 to 26!
+
R has a built-in object called LETTERS. It is a
+26-character vector, from A to Z. Set the names of the number sequence
+to this 26 letters
+
Test yourself by calling letter_no["B"], which should
+give you the number 2!
+
+
+
+
+
+
+
+
+
letter_no<-1:26# or seq(1,26)
+names(letter_no)<-LETTERS
+letter_no["B"]
+
+
+
+
+
+
Data frames
+
+
We have data frames at the very beginning of this lesson, they
+represent a table of data. We didn’t go much further into detail with
+our example cat data frame:
We can now understand something a bit surprising in our data.frame;
+what happens if we run:
+
+
R
+
+
+typeof(cats)
+
+
+
OUTPUT
+
+
[1] "list"
+
+
We see that data.frames look like lists ‘under the hood’. Think again
+what we heard about what lists can be used for:
+
+
Lists organize data of different types
+
+
Columns of a data frame are vectors of different types, that are
+organized by belonging to the same table.
+
A data.frame is really a list of vectors. It is a special list in
+which all the vectors must have the same length.
+
How is this “special”-ness written into the object, so that R does
+not treat it like any other list, but as a table?
+
+
R
+
+
+class(cats)
+
+
+
OUTPUT
+
+
[1] "data.frame"
+
+
A class, just like names, is an attribute attached
+to the object. It tells us what this object means for humans.
+
You might wonder: Why do we need another
+what-type-of-object-is-this-function? We already have
+typeof()? That function tells us how the object is
+constructed in the computer. The class is
+the meaning of the object for humans. Consequently,
+what typeof() returns is fixed in R (mainly the
+five data types), whereas the output of class() is
+diverse and extendable by R packages.
+
In our cats example, we have an integer, a double and a
+logical variable. As we have seen already, each column of data.frame is
+a vector.
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
+
R
+
+
+cats[,1]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
+
R
+
+
+typeof(cats[,1])
+
+
+
OUTPUT
+
+
[1] "character"
+
+
+
R
+
+
+str(cats[,1])
+
+
+
OUTPUT
+
+
chr [1:3] "calico" "black" "tabby"
+
+
Each row is an observation of different variables, itself a
+data.frame, and thus can be composed of elements of different types.
There are several subtly different ways to call variables,
+observations and elements from data.frames:
+
cats[1]
+
cats[[1]]
+
cats$coat
+
cats["coat"]
+
cats[1, 1]
+
cats[, 1]
+
cats[1, ]
+
Try out these examples and explain what is returned by each one.
+
Hint: Use the function typeof() to examine what
+is returned in each case.
+
+
+
+
+
+
+
+
+
+
R
+
+
+cats[1]
+
+
+
OUTPUT
+
+
coat
+1 calico
+2 black
+3 tabby
+
+
We can think of a data frame as a list of vectors. The single brace
+[1] returns the first slice of the list, as another list.
+In this case it is the first column of the data frame.
+
+
R
+
+
+cats[[1]]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
The double brace [[1]] returns the contents of the list
+item. In this case it is the contents of the first column, a
+vector of type character.
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
This example uses the $ character to address items by
+name. coat is the first column of the data frame, again a
+vector of type character.
+
+
R
+
+
+cats["coat"]
+
+
+
OUTPUT
+
+
coat
+1 calico
+2 black
+3 tabby
+
+
Here we are using a single brace ["coat"] replacing the
+index number with the column name. Like example 1, the returned object
+is a list.
+
+
R
+
+
+cats[1, 1]
+
+
+
OUTPUT
+
+
[1] "calico"
+
+
This example uses a single brace, but this time we provide row and
+column coordinates. The returned object is the value in row 1, column 1.
+The object is a vector of type character.
+
+
R
+
+
+cats[, 1]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
Like the previous example we use single braces and provide row and
+column coordinates. The row coordinate is not specified, R interprets
+this missing value as all the elements in this column and
+returns them as a vector.
+
+
R
+
+
+cats[1, ]
+
+
+
OUTPUT
+
+
coat weight likes_string
+1 calico 2.1 TRUE
+
+
Again we use the single brace with row and column coordinates. The
+column coordinate is not specified. The return value is a list
+containing all the values in the first row.
+
+
+
+
+
+
+
+
+
+
Tip: Renaming data frame columns
+
+
+
Data frames have column names, which can be accessed with the
+names() function.
+
+
R
+
+
+names(cats)
+
+
+
OUTPUT
+
+
[1] "coat" "weight" "likes_string"
+
+
If you want to rename the second column of cats, you can
+assign a new name to the second element of names(cats).
Because a matrix is a vector with added dimension attributes,
+length gives you the total number of elements in the
+matrix.
+
+
+
+
+
+
+
+
+
+
Challenge 7
+
+
+
Make another matrix, this time containing the numbers 1:50, with 5
+columns and 10 rows. Did the matrix function fill your
+matrix by column, or by row, as its default behaviour? See if you can
+figure out how to change this. (hint: read the documentation for
+matrix!)
+
+
+
+
+
+
+
+
+
Make another matrix, this time containing the numbers 1:50, with 5
+columns and 10 rows. Did the matrix function fill your
+matrix by column, or by row, as its default behaviour? See if you can
+figure out how to change this. (hint: read the documentation for
+matrix!)
+
+
R
+
+
+x<-matrix(1:50, ncol=5, nrow=10)
+x<-matrix(1:50, ncol=5, nrow=10, byrow =TRUE)# to fill by row
+
+
+
+
+
+
+
+
+
+
+
Challenge 8
+
+
+
Create a list of length two containing a character vector for each of
+the sections in this part of the workshop:
+
Data types
+
Data structures
+
Populate each character vector with the names of the data types and
+data structures we’ve seen so far.
Note: it’s nice to make a list in big writing on the board or taped
+to the wall listing all of these types and structures - leave it up for
+the rest of the workshop to remind people of the importance of these
+basics.
+
+
+
+
+
+
+
+
+
+
Challenge 9
+
+
+
Consider the R output of the matrix below:
+
+
OUTPUT
+
+
[,1] [,2]
+[1,] 4 1
+[2,] 9 5
+[3,] 10 7
+
+
What was the correct command used to write this matrix? Examine each
+command and try to figure out the correct one before typing them. Think
+about what matrices the other commands will produce.
What was the correct command used to write this matrix? Examine each
+command and try to figure out the correct one before typing them. Think
+about what matrices the other commands will produce.
Display basic properties of data frames including size and class of
+the columns, names, and first few rows.
+
+
+
+
+
+
At this point, you’ve seen it all: in the last lesson, we toured all
+the basic data types and data structures in R. Everything you do will be
+a manipulation of those tools. But most of the time, the star of the
+show is the data frame—the table that we created by loading information
+from a csv file. In this lesson, we’ll learn a few more things about
+working with data frames.
+
Adding columns and rows in data frames
+
+
We already learned that the columns of a data frame are vectors, so
+that our data are consistent in type throughout the columns. As such, if
+we want to add a new column, we can start by making a new vector:
coat weight likes_string age
+1 calico 2.1 1 2
+2 black 5.0 0 3
+3 tabby 3.2 1 5
+
+
Notice the comma with nothing after it to indicate that we want to
+drop the entire fourth row.
+
Note: we could also remove several rows at once by putting the row
+numbers inside of a vector, for example:
+cats[c(-3,-4), ]
+
Removing columns
+
+
We can also remove columns in our data frame. What if we want to
+remove the column “age”. We can remove it in two ways, by variable
+number or by index.
Notice the comma with nothing before it, indicating we want to keep
+all of the rows.
+
Alternatively, we can drop the column by using the index name and the
+%in% operator. The %in% operator goes through
+each element of its left argument, in this case the names of
+cats, and asks, “Does this element occur in the second
+argument?”
The key to remember when adding data to a data frame is that
+columns are vectors and rows are lists. We can also glue two
+data frames together with rbind:
You can create a new data frame right from within R with the
+following syntax:
+
+
R
+
+
+df<-data.frame(id =c("a", "b", "c"),
+ x =1:3,
+ y =c(TRUE, TRUE, FALSE))
+
+
Make a data frame that holds the following information for
+yourself:
+
first name
+
last name
+
lucky number
+
Then use rbind to add an entry for the people sitting
+beside you. Finally, use cbind to add a column with each
+person’s answer to the question, “Is it time for coffee break?”
So far, you have seen the basics of manipulating data frames with our
+cat data; now let’s use those skills to digest a more realistic dataset.
+Let’s read in the gapminder dataset that we downloaded
+previously:
+
+
R
+
+
+gapminder<-read.csv("data/gapminder_data.csv")
+
+
+
+
+
+
+
Miscellaneous Tips
+
+
+
Another type of file you might encounter are tab-separated value
+files (.tsv). To specify a tab as a separator, use "\\t" or
+read.delim().
+
Files can also be downloaded directly from the Internet into a
+local folder of your choice onto your computer using the
+download.file function. The read.csv function
+can then be executed to read the downloaded file from the download
+location, for example,
Alternatively, you can also read in files directly into R from the
+Internet by replacing the file paths with a web address in
+read.csv. One should note that in doing this no local copy
+of the csv file is first saved onto your computer. For example,
You can read directly from excel spreadsheets without converting
+them to plain text first by using the readxl
+package.
+
The argument “stringsAsFactors” can be useful to tell R how to
+read strings either as factors or as character strings. In R versions
+after 4.0, all strings are read-in as characters by default, but in
+earlier versions of R, strings are read-in as factors by default. For
+more information, see the call-out in the
+previous episode.
+
+
+
+
Let’s investigate gapminder a bit; the first thing we should always
+do is check out what the data looks like with str:
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+
+
An additional method for examining the structure of gapminder is to
+use the summary function. This function can be used on
+various objects in R. For data frames, summary yields a
+numeric, tabular, or descriptive summary of each column. Numeric or
+integer columns are described by the descriptive statistics (quartiles
+and mean), and character columns by its length, class, and mode.
+
+
R
+
+
+summary(gapminder)
+
+
+
OUTPUT
+
+
country year pop continent
+ Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
+ Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
+ Mode :character Median :1980 Median :7.024e+06 Mode :character
+ Mean :1980 Mean :2.960e+07
+ 3rd Qu.:1993 3rd Qu.:1.959e+07
+ Max. :2007 Max. :1.319e+09
+ lifeExp gdpPercap
+ Min. :23.60 Min. : 241.2
+ 1st Qu.:48.20 1st Qu.: 1202.1
+ Median :60.71 Median : 3531.8
+ Mean :59.47 Mean : 7215.3
+ 3rd Qu.:70.85 3rd Qu.: 9325.5
+ Max. :82.60 Max. :113523.1
+
+
Along with the str and summary functions,
+we can examine individual columns of the data frame with our
+typeof function:
We can also interrogate the data frame for information about its
+dimensions; remembering that str(gapminder) said there were
+1704 observations of 6 variables in gapminder, what do you think the
+following will produce, and why?
+
+
R
+
+
+length(gapminder)
+
+
+
OUTPUT
+
+
[1] 6
+
+
A fair guess would have been to say that the length of a data frame
+would be the number of rows it has (1704), but this is not the case;
+remember, a data frame is a list of vectors and factors:
+
+
R
+
+
+typeof(gapminder)
+
+
+
OUTPUT
+
+
[1] "list"
+
+
When length gave us 6, it’s because gapminder is built
+out of a list of 6 columns. To get the number of rows and columns in our
+dataset, try:
+
+
R
+
+
+nrow(gapminder)
+
+
+
OUTPUT
+
+
[1] 1704
+
+
+
R
+
+
+ncol(gapminder)
+
+
+
OUTPUT
+
+
[1] 6
+
+
Or, both at once:
+
+
R
+
+
+dim(gapminder)
+
+
+
OUTPUT
+
+
[1] 1704 6
+
+
We’ll also likely want to know what the titles of all the columns
+are, so we can ask for them later:
At this stage, it’s important to ask ourselves if the structure R is
+reporting matches our intuition or expectations; do the basic data types
+reported for each column make sense? If not, we need to sort any
+problems out now before they turn into bad surprises down the road,
+using what we’ve learned about how R interprets data, and the importance
+of strict consistency in how we record our data.
+
Once we’re happy that the data types and structures seem reasonable,
+it’s time to start digging into our data proper. Check out the first few
+lines:
+
+
R
+
+
+head(gapminder)
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+4 Afghanistan 1967 11537966 Asia 34.020 836.1971
+5 Afghanistan 1972 13079460 Asia 36.088 739.9811
+6 Afghanistan 1977 14880372 Asia 38.438 786.1134
+
+
+
+
+
+
+
Challenge 2
+
+
+
It’s good practice to also check the last few lines of your data and
+some in the middle. How would you do this?
+
Searching for ones specifically in the middle isn’t too hard, but we
+could ask for a few lines at random. How would you code this?
+
+
+
+
+
+
+
+
+
To check the last few lines it’s relatively simple as R already has a
+function for this:
+
+
R
+
+
+tail(gapminder)
+tail(gapminder, n =15)
+
+
What about a few arbitrary rows just in case something is odd in the
+middle?
+
+
Tip: There are several ways to achieve this.
+
The solution here presents one form of using nested functions, i.e. a
+function passed as an argument to another function. This might sound
+like a new concept, but you are already using it! Remember
+my_dataframe[rows, cols] will print to screen your data frame with the
+number of rows and columns you asked for (although you might have asked
+for a range or named columns for example). How would you get the last
+row if you don’t know how many rows your data frame has? R has a
+function for this. What about getting a (pseudorandom) sample? R also
+has a function for this.
+
+
R
+
+
+gapminder[sample(nrow(gapminder), 5), ]
+
+
+
+
+
+
+
To make sure our analysis is reproducible, we should put the code
+into a script file so we can come back to it later.
+
+
+
+
+
+
Challenge 3
+
+
+
Go to file -> new file -> R script, and write an R script to
+load in the gapminder dataset. Put it in the scripts/
+directory and add it to version control.
+
Run the script using the source function, using the file
+path as its argument (or by pressing the “source” button in
+RStudio).
+
+
+
+
+
+
+
+
+
The source function can be used to use a script within a
+script. Assume you would like to load the same type of file over and
+over again and therefore you need to specify the arguments to fit the
+needs of your file. Instead of writing the necessary argument again and
+again you could just write it once and save it as a script. Then, you
+can use source("Your_Script_containing_the_load_function")
+in a new script to use the function of that script without writing
+everything again. Check out ?source to find out more.
To run the script and load the data into the gapminder
+variable:
+
+
R
+
+
+source(file ="scripts/load-gapminder.R")
+
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Read the output of str(gapminder) again; this time, use
+what you’ve learned about lists and vectors, as well as the output of
+functions like colnames and dim to explain
+what everything that str prints out for gapminder means. If
+there are any parts you can’t interpret, discuss with your
+neighbors!
+
+
+
+
+
+
+
+
+
The object gapminder is a data frame with columns
+
+country and continent are character
+strings.
+
+year is an integer vector.
+
+pop, lifeExp, and gdpPercap
+are numeric vectors.
+
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
Use cbind() to add a new column to a data frame.
+
Use rbind() to add a new row to a data frame.
+
Remove rows from a data frame.
+
Use str(), summary(), nrow(),
+ncol(), dim(), colnames(),
+rownames(), head(), and typeof()
+to understand the structure of a data frame.
+
Read in a csv file using read.csv().
+
Understand what length() of a data frame
+represents.
In R, simple vectors containing character strings, numbers, or
+logical values are called atomic vectors because they can’t be
+further simplified.
+
+
+
+
So now that we’ve created a dummy vector to play with, how do we get
+at its contents?
+
Accessing elements using their indices
+
+
To extract elements of a vector we can give their corresponding
+index, starting from one:
+
+
R
+
+
+x[1]
+
+
+
OUTPUT
+
+
a
+5.4
+
+
+
R
+
+
+x[4]
+
+
+
OUTPUT
+
+
d
+4.8
+
+
It may look different, but the square brackets operator is a
+function. For vectors (and matrices), it means “get me the nth
+element”.
+
We can ask for multiple elements at once:
+
+
R
+
+
+x[c(1, 3)]
+
+
+
OUTPUT
+
+
a c
+5.4 7.1
+
+
Or slices of the vector:
+
+
R
+
+
+x[1:4]
+
+
+
OUTPUT
+
+
a b c d
+5.4 6.2 7.1 4.8
+
+
the : operator creates a sequence of numbers from the
+left element to the right.
+
+
R
+
+
+1:4
+
+
+
OUTPUT
+
+
[1] 1 2 3 4
+
+
+
R
+
+
+c(1, 2, 3, 4)
+
+
+
OUTPUT
+
+
[1] 1 2 3 4
+
+
We can ask for the same element multiple times:
+
+
R
+
+
+x[c(1,1,3)]
+
+
+
OUTPUT
+
+
a a c
+5.4 5.4 7.1
+
+
If we ask for an index beyond the length of the vector, R will return
+a missing value:
+
+
R
+
+
+x[6]
+
+
+
OUTPUT
+
+
<NA>
+ NA
+
+
This is a vector of length one containing an NA, whose
+name is also NA.
+
If we ask for the 0th element, we get an empty vector:
+
+
R
+
+
+x[0]
+
+
+
OUTPUT
+
+
named numeric(0)
+
+
+
+
+
+
+
Vector numbering in R starts at 1
+
+
+
In many programming languages (C and Python, for example), the first
+element of a vector has an index of 0. In R, the first element is 1.
+
+
+
+
Skipping and removing elements
+
+
If we use a negative number as the index of a vector, R will return
+every element except for the one specified:
+
+
R
+
+
+x[-2]
+
+
+
OUTPUT
+
+
a c d e
+5.4 7.1 4.8 7.5
+
+
We can skip multiple elements:
+
+
R
+
+
+x[c(-1, -5)]# or x[-c(1,5)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
+
+
+
+
Tip: Order of operations
+
+
+
A common trip up for novices occurs when trying to skip slices of a
+vector. It’s natural to try to negate a sequence like so:
+
+
R
+
+
+x[-1:3]
+
+
This gives a somewhat cryptic error:
+
+
ERROR
+
+
Error in x[-1:3]: only 0's may be mixed with negative subscripts
+
+
But remember the order of operations. : is really a
+function. It takes its first argument as -1, and its second as 3, so
+generates the sequence of numbers: c(-1, 0, 1, 2, 3).
+
The correct solution is to wrap that function call in brackets, so
+that the - operator applies to the result:
+
+
R
+
+
+x[-(1:3)]
+
+
+
OUTPUT
+
+
d e
+4.8 7.5
+
+
+
+
+
To remove elements from a vector, we need to assign the result back
+into the variable:
Come up with at least 2 different commands that will produce the
+following output:
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
After you find 2 different commands, compare notes with your
+neighbour. Did you have different strategies?
+
+
+
+
+
+
+
+
+
+
R
+
+
+x[2:4]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
R
+
+
+x[-c(1,5)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
R
+
+
+x[c(2,3,4)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
+
+
+
Subsetting by name
+
+
We can extract elements by using their name, instead of extracting by
+index:
+
+
R
+
+
+x<-c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5)# we can name a vector 'on the fly'
+x[c("a", "c")]
+
+
+
OUTPUT
+
+
a c
+5.4 7.1
+
+
This is usually a much more reliable way to subset objects: the
+position of various elements can often change when chaining together
+subsetting operations, but the names will always remain the same!
+
Subsetting through other logical operations
+
+
We can also use any logical vector to subset:
+
+
R
+
+
+x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
+
+
+
OUTPUT
+
+
c e
+7.1 7.5
+
+
Since comparison operators (e.g. >,
+<, ==) evaluate to logical vectors, we can
+also use them to succinctly subset vectors: the following statement
+gives the same result as the previous one.
+
+
R
+
+
+x[x>7]
+
+
+
OUTPUT
+
+
c e
+7.1 7.5
+
+
Breaking it down, this statement first evaluates x>7,
+generating a logical vector
+c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the
+elements of x corresponding to the TRUE
+values.
+
We can use == to mimic the previous method of indexing
+by name (remember you have to use == rather than
+= for comparisons):
+
+
R
+
+
+x[names(x)=="a"]
+
+
+
OUTPUT
+
+
a
+5.4
+
+
+
+
+
+
+
Tip: Combining logical conditions
+
+
+
We often want to combine multiple logical criteria. For example, we
+might want to find all the countries that are located in Asia
+or Europe and have life expectancies
+within a certain range. Several operations for combining logical vectors
+exist in R:
+
+&, the “logical AND” operator: returns
+TRUE if both the left and right are TRUE.
+
+|, the “logical OR” operator: returns
+TRUE, if either the left or right (or both) are
+TRUE.
+
You may sometimes see && and ||
+instead of & and |. These two-character
+operators only look at the first element of each vector and ignore the
+remaining elements. In general you should not use the two-character
+operators in data analysis; save them for programming, i.e. deciding
+whether to execute a statement.
+
+!, the “logical NOT” operator: converts
+TRUE to FALSE and FALSE to
+TRUE. It can negate a single logical condition (eg
+!TRUE becomes FALSE), or a whole vector of
+conditions(eg !c(TRUE, FALSE) becomes
+c(FALSE, TRUE)).
+
Additionally, you can compare the elements within a single vector
+using the all function (which returns TRUE if
+every element of the vector is TRUE) and the
+any function (which returns TRUE if one or
+more elements of the vector are TRUE).
Write a subsetting command to return the values in x that are greater
+than 4 and less than 7.
+
+
+
+
+
+
+
+
+
+
R
+
+
+x_subset<-x[x<7&x>4]
+print(x_subset)
+
+
+
OUTPUT
+
+
a b d
+5.4 6.2 4.8
+
+
+
+
+
+
+
+
+
+
+
Tip: Non-unique names
+
+
+
You should be aware that it is possible for multiple elements in a
+vector to have the same name. (For a data frame, columns can have the
+same name — although R tries to avoid this — but row names must be
+unique.) Consider these examples:
+
+
R
+
+
+x<-1:3
+x
+
+
+
OUTPUT
+
+
[1] 1 2 3
+
+
+
R
+
+
+names(x)<-c('a', 'a', 'a')
+x
+
+
+
OUTPUT
+
+
a a a
+1 2 3
+
+
+
R
+
+
+x['a']# only returns first value
+
+
+
OUTPUT
+
+
a
+1
+
+
+
R
+
+
+x[names(x)=='a']# returns all three values
+
+
+
OUTPUT
+
+
a a a
+1 2 3
+
+
+
+
+
+
+
+
+
+
Tip: Getting help for operators
+
+
+
Remember you can search for help on operators by wrapping them in
+quotes: help("%in%") or ?"%in%".
+
+
+
+
Skipping named elements
+
+
Skipping or removing named elements is a little harder. If we try to
+skip one named element by negating the string, R complains (slightly
+obscurely) that it doesn’t know how to take the negative of a
+string:
+
+
R
+
+
+x<-c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5)# we start again by naming a vector 'on the fly'
+x[-"a"]
+
+
+
ERROR
+
+
Error in -"a": invalid argument to unary operator
+
+
However, we can use the != (not-equals) operator to
+construct a logical vector that will do what we want:
+
+
R
+
+
+x[names(x)!="a"]
+
+
+
OUTPUT
+
+
b c d e
+6.2 7.1 4.8 7.5
+
+
Skipping multiple named indices is a little bit harder still. Suppose
+we want to drop the "a" and "c" elements, so
+we try this:
+
+
R
+
+
+x[names(x)!=c("a","c")]
+
+
+
WARNING
+
+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+
OUTPUT
+
+
b c d e
+6.2 7.1 4.8 7.5
+
+
R did something, but it gave us a warning that we ought to
+pay attention to - and it apparently gave us the wrong answer
+(the "c" element is still included in the vector)!
+
So what does != actually do in this case? That’s an
+excellent question.
+
+
Recycling
+
Let’s take a look at the comparison component of this code:
+
+
R
+
+
+names(x)!=c("a", "c")
+
+
+
WARNING
+
+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+
OUTPUT
+
+
[1] FALSE TRUE TRUE TRUE TRUE
+
+
Why does R give TRUE as the third element of this
+vector, when names(x)[3] != "c" is obviously false? When
+you use !=, R tries to compare each element of the left
+argument with the corresponding element of its right argument. What
+happens when you compare vectors of different lengths?
+
When one vector is shorter than the other, it gets
+recycled:
+
In this case R repeatsc("a", "c") as
+many times as necessary to match names(x), i.e. we get
+c("a","c","a","c","a"). Since the recycled "a"
+doesn’t match the third element of names(x), the value of
+!= is TRUE. Because in this case the longer
+vector length (5) isn’t a multiple of the shorter vector length (2), R
+printed a warning message. If we had been unlucky and
+names(x) had contained six elements, R would
+silently have done the wrong thing (i.e., not what we intended
+it to do). This recycling rule can can introduce hard-to-find and subtle
+bugs!
+
The way to get R to do what we really want (match each
+element of the left argument with all of the elements of the
+right argument) it to use the %in% operator. The
+%in% operator goes through each element of its left
+argument, in this case the names of x, and asks, “Does this
+element occur in the second argument?”. Here, since we want to
+exclude values, we also need a ! operator to
+change “in” to “not in”:
+
+
R
+
+
+x[!names(x)%in%c("a","c")]
+
+
+
OUTPUT
+
+
b d e
+6.2 4.8 7.5
+
+
+
+
+
+
+
Challenge 3
+
+
+
Selecting elements of a vector that match any of a list of components
+is a very common data analysis task. For example, the gapminder data set
+contains country and continent variables, but
+no information between these two scales. Suppose we want to pull out
+information from southeast Asia: how do we set up an operation to
+produce a logical vector that is TRUE for all of the
+countries in southeast Asia and FALSE otherwise?
+
Suppose you have these data:
+
+
R
+
+
+seAsia<-c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder<-read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries<-unique(as.character(gapminder$country))
+
+
There’s a wrong way (using only ==), which will give you
+a warning; a clunky way (using the logical operators == and
+|); and an elegant way (using %in%). See
+whether you can come up with all three and explain how they (don’t)
+work.
+
+
+
+
+
+
+
+
+
The wrong way to do this problem is
+countries==seAsia. This gives a warning
+("In countries == seAsia : longer object length is not a multiple of shorter object length")
+and the wrong answer (a vector of all FALSE values),
+because none of the recycled values of seAsia happen to
+line up correctly with matching values in country.
+
The clunky (but technically correct) way to do this
+problem is
(or countries==seAsia[1] | countries==seAsia[2] | ...).
+This gives the correct values, but hopefully you can see how awkward it
+is (what if we wanted to select countries from a much longer list?).
+
The best way to do this problem is
+countries %in% seAsia, which is both correct and easy to
+type (and read).
+
+
+
+
+
+
Handling special values
+
+
At some point you will encounter functions in R that cannot handle
+missing, infinite, or undefined data.
+
There are a number of special functions you can use to filter out
+this data:
+
+is.na will return all positions in a vector, matrix, or
+data.frame containing NA (or NaN)
+
likewise, is.nan, and is.infinite will do
+the same for NaN and Inf.
+
+is.finite will return all positions in a vector,
+matrix, or data.frame that do not contain NA,
+NaN or Inf.
+
+na.omit will filter out all missing values from a
+vector
+
Factor subsetting
+
+
Now that we’ve explored the different ways to subset vectors, how do
+we subset the other data structures?
+
Factor subsetting works the same way as vector subsetting.
Unlike vectors, if we try to access a row or column outside of the
+matrix, R will throw an error:
+
+
R
+
+
+m[, c(3,6)]
+
+
+
ERROR
+
+
Error in m[, c(3, 6)]: subscript out of bounds
+
+
+
+
+
+
+
Tip: Higher dimensional arrays
+
+
+
when dealing with multi-dimensional arrays, each argument to
+[ corresponds to a dimension. For example, a 3D array, the
+first three arguments correspond to the rows, columns, and depth
+dimension.
+
+
+
+
Because matrices are vectors, we can also subset using only one
+argument:
+
+
R
+
+
+m[5]
+
+
+
OUTPUT
+
+
[1] 0.3295078
+
+
This usually isn’t useful, and often confusing to read. However it is
+useful to note that matrices are laid out in column-major
+format by default. That is the elements of the vector are arranged
+column-wise:
+
+
R
+
+
+matrix(1:6, nrow=2, ncol=3)
+
+
+
OUTPUT
+
+
[,1] [,2] [,3]
+[1,] 1 3 5
+[2,] 2 4 6
+
+
If you wish to populate the matrix by row, use
+byrow=TRUE:
+
+
R
+
+
+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
+
+
+
OUTPUT
+
+
[,1] [,2] [,3]
+[1,] 1 2 3
+[2,] 4 5 6
+
+
Matrices can also be subsetted using their rownames and column names
+instead of their row and column indices.
Which of the following commands will extract the values 11 and
+14?
+
A. m[2,4,2,5]
+
B. m[2:5]
+
C. m[4:5,2]
+
D. m[2,c(4,5)]
+
+
+
+
+
+
+
+
+
D
+
+
+
+
+
List subsetting
+
+
Now we’ll introduce some new subsetting operators. There are three
+functions used to subset lists. We’ve already seen these when learning
+about atomic vectors and matrices: [, [[, and
+$.
+
Using [ will always return a list. If you want to
+subset a list, but not extract an element, then you
+will likely use [.
+
+
R
+
+
+xlist<-list(a ="Software Carpentry", b =1:10, data =head(mtcars))
+xlist[1]
+
+
+
OUTPUT
+
+
$a
+[1] "Software Carpentry"
+
+
This returns a list with one element.
+
We can subset elements of a list exactly the same way as atomic
+vectors using [. Comparison operations however won’t work
+as they’re not recursive, they will try to condition on the data
+structures in each element of the list, not the individual elements
+within those data structures.
+xlist<-list(a ="Software Carpentry", b =1:10, data =head(mtcars))
+
+
Using your knowledge of both list and vector subsetting, extract the
+number 2 from xlist. Hint: the number 2 is contained within the “b” item
+in the list.
+
+
+
+
+
+
+
+
+
+
R
+
+
+xlist$b[2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
R
+
+
+xlist[[2]][2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
R
+
+
+xlist[["b"]][2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
+
+
+
+
+
+
+
+
Challenge 6
+
+
+
Given a linear model:
+
+
R
+
+
+mod<-aov(pop~lifeExp, data=gapminder)
+
+
Extract the residual degrees of freedom (hint:
+attributes() will help you)
+
+
+
+
+
+
+
+
+
+
R
+
+
+attributes(mod)## `df.residual` is one of the names of `mod`
+
+
+
R
+
+
+mod$df.residual
+
+
+
+
+
+
Data frames
+
+
Remember the data frames are lists underneath the hood, so similar
+rules apply. However they are also two dimensional objects:
+
[ with one argument will act the same way as for lists,
+where each list element corresponds to a column. The resulting object
+will be a data frame:
Similarly, [[ will act to extract a single
+column:
+
+
R
+
+
+head(gapminder[["lifeExp"]])
+
+
+
OUTPUT
+
+
[1] 28.801 30.332 31.997 34.020 36.088 38.438
+
+
And $ provides a convenient shorthand to extract columns
+by name:
+
+
R
+
+
+head(gapminder$year)
+
+
+
OUTPUT
+
+
[1] 1952 1957 1962 1967 1972 1977
+
+
With two arguments, [ behaves the same way as for
+matrices:
+
+
R
+
+
+gapminder[1:3,]
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+
+
If we subset a single row, the result will be a data frame (because
+the elements are mixed types):
+
+
R
+
+
+gapminder[3,]
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+
+
But for a single column the result will be a vector (this can be
+changed with the third argument, drop = FALSE).
+
+
+
+
+
+
Challenge 7
+
+
+
Fix each of the following common data frame subsetting errors:
+
Extract observations collected for the year 1957
+
+
R
+
+
gapminder[gapminder$year =1957,]
+
+
Extract all columns except 1 through to 4
+
+
R
+
+
+gapminder[,-1:4]
+
+
Extract the rows where the life expectancy is longer the 80
+years
+
+
R
+
+
+gapminder[gapminder$lifeExp>80]
+
+
Extract the first row, and the fourth and fifth columns
+(continent and lifeExp).
+
+
R
+
+
+gapminder[1, 4, 5]
+
+
Advanced: extract rows that contain information for the years 2002
+and 2007
+
+
R
+
+
+gapminder[gapminder$year==2002|2007,]
+
+
+
+
+
+
+
+
+
+
Fix each of the following common data frame subsetting errors:
Write conditional statements with if...else statements
+and ifelse().
+
Write and understand for() loops.
+
+
+
+
+
+
Often when we’re coding we want to control the flow of our actions.
+This can be done by setting actions to occur only if a condition or a
+set of conditions are met. Alternatively, we can also set an action to
+occur a particular number of times.
+
There are several ways you can control flow in R. For conditional
+statements, the most commonly used approaches are the constructs:
+
+
R
+
+
# if
+if (condition is true) {
+ perform action
+}
+
+# if ... else
+if (condition is true) {
+ perform action
+} else { # that is, if the condition is false,
+ perform alternative action
+}
+
+
Say, for example, that we want R to print a message if a variable
+x has a particular value:
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}
+
+x
+
+
+
OUTPUT
+
+
[1] 8
+
+
The print statement does not appear in the console because x is not
+greater than 10. To print a different message for numbers less than 10,
+we can add an else statement.
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}else{
+print("x is less than 10")
+}
+
+
+
OUTPUT
+
+
[1] "x is less than 10"
+
+
You can also test multiple conditions by using
+else if.
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}elseif(x>5){
+print("x is greater than 5, but less than 10")
+}else{
+print("x is less than 5")
+}
+
+
+
OUTPUT
+
+
[1] "x is greater than 5, but less than 10"
+
+
Important: when R evaluates the condition inside
+if() statements, it is looking for a logical element, i.e.,
+TRUE or FALSE. This can cause some headaches
+for beginners. For example:
+
+
R
+
+
+x<-4==3
+if(x){
+"4 equals 3"
+}else{
+"4 does not equal 3"
+}
+
+
+
OUTPUT
+
+
[1] "4 does not equal 3"
+
+
As we can see, the not equal message was printed because the vector x
+is FALSE
+
+
R
+
+
+x<-4==3
+x
+
+
+
OUTPUT
+
+
[1] FALSE
+
+
+
+
+
+
+
Challenge 1
+
+
+
Use an if() statement to print a suitable message
+reporting whether there are any records from 2002 in the
+gapminder dataset. Now do the same for 2012.
+
+
+
+
+
+
+
+
+
We will first see a solution to Challenge 1 which does not use the
+any() function. We first obtain a logical vector describing
+which element of gapminder$year is equal to
+2002:
+
+
R
+
+
+gapminder[(gapminder$year==2002),]
+
+
Then, we count the number of rows of the data.frame
+gapminder that correspond to the 2002:
The presence of any record for the year 2002 is equivalent to the
+request that rows2002_number is one or more:
+
+
R
+
+
+rows2002_number>=1
+
+
Putting all together, we obtain:
+
+
R
+
+
+if(nrow(gapminder[(gapminder$year==2002),])>=1){
+print("Record(s) for the year 2002 found.")
+}
+
+
All this can be done more quickly with any(). The
+logical condition can be expressed as:
+
+
R
+
+
+if(any(gapminder$year==2002)){
+print("Record(s) for the year 2002 found.")
+}
+
+
+
+
+
+
Did anyone get a warning message like this?
+
+
ERROR
+
+
Error in if (gapminder$year == 2012) {: the condition has length > 1
+
+
The if() function only accepts singular (of length 1)
+inputs, and therefore returns an error when you use it with a vector.
+The if() function will still run, but will only evaluate
+the condition in the first element of the vector. Therefore, to use the
+if() function, you need to make sure your input is singular
+(of length 1).
+
+
+
+
+
+
Tip: Built in ifelse()
+function
+
+
+
R accepts both if() and
+else if() statements structured as outlined above, but also
+statements using R’s built-in ifelse()
+function. This function accepts both singular and vector inputs and is
+structured as follows:
+
+
R
+
+
# ifelse function
+ifelse(condition is true, perform action, perform alternative action)
+
+
where the first argument is the condition or a set of conditions to
+be met, the second argument is the statement that is evaluated when the
+condition is TRUE, and the third statement is the statement
+that is evaluated when the condition is FALSE.
+
+
R
+
+
+y<--3
+ifelse(y<0, "y is a negative number", "y is either positive or zero")
+
+
+
OUTPUT
+
+
[1] "y is a negative number"
+
+
+
+
+
+
+
+
+
+
Tip: any() and
+all()
+
+
+
The any() function will return TRUE if at
+least one TRUE value is found within a vector, otherwise it
+will return FALSE. This can be used in a similar way to the
+%in% operator. The function all(), as the name
+suggests, will only return TRUE if all values in the vector
+are TRUE.
+
+
+
+
Repeating operations
+
+
If you want to iterate over a set of values, when the order of
+iteration is important, and perform the same operation on each, a
+for() loop will do the job. We saw for() loops
+in the shell
+lessons earlier. This is the most flexible of looping operations,
+but therefore also the hardest to use correctly. In general, the advice
+of many R users would be to learn about for()
+loops, but to avoid using for() loops unless the order of
+iteration is important: i.e. the calculation at each iteration depends
+on the results of previous iterations. If the order of iteration is not
+important, then you should learn about vectorized alternatives, such as
+the purrr package, as they pay off in computational
+efficiency.
We notice in the output that when the first index (i) is
+set to 1, the second index (j) iterates through its full
+set of indices. Once the indices of j have been iterated
+through, then i is incremented. This process continues
+until the last index has been used for each for() loop.
+
Rather than printing the results, we could write the loop output to a
+new object.
This approach can be useful, but ‘growing your results’ (building the
+result object incrementally) is computationally inefficient, so avoid it
+when you are iterating through a lot of values.
+
+
+
+
+
+
Tip: don’t grow your results
+
+
+
One of the biggest things that trips up novices and experienced R
+users alike, is building a results object (vector, list, matrix, data
+frame) as your for loop progresses. Computers are very bad at handling
+this, so your calculations can very quickly slow to a crawl. It’s much
+better to define an empty results object before hand of appropriate
+dimensions, rather than initializing an empty object without dimensions.
+So if you know the end result will be stored in a matrix like above,
+create an empty matrix with 5 row and 5 columns, then at each iteration
+store the results in the appropriate location.
+
+
+
+
A better way is to define your (empty) output object before filling
+in the values. For this example, it looks more involved, but is still
+more efficient.
Sometimes you will find yourself needing to repeat an operation as
+long as a certain condition is met. You can do this with a
+while() loop.
+
+
R
+
+
while(this condition is true){
+ do a thing
+}
+
+
R will interpret a condition being met as “TRUE”.
+
As an example, here’s a while loop that generates random numbers from
+a uniform distribution (the runif() function) between 0 and
+1 until it gets one that’s less than 0.1.
while() loops will not always be appropriate. You have
+to be particularly careful that you don’t end up stuck in an infinite
+loop because your condition is always met and hence the while statement
+never terminates.
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Compare the objects output_vector and
+output_vector2. Are they the same? If not, why not? How
+would you change the last block of code to make
+output_vector2 the same as output_vector?
+
+
+
+
+
+
+
+
+
We can check whether the two vectors are identical using the
+all() function:
+
+
R
+
+
+all(output_vector==output_vector2)
+
+
However, all the elements of output_vector can be found
+in output_vector2:
+
+
R
+
+
+all(output_vector%in%output_vector2)
+
+
and vice versa:
+
+
R
+
+
+all(output_vector2%in%output_vector)
+
+
therefore, the element in output_vector and
+output_vector2 are just sorted in a different order. This
+is because as.vector() outputs the elements of an input
+matrix going over its column. Taking a look at
+output_matrix, we can notice that we want its elements by
+rows. The solution is to transpose the output_matrix. We
+can do it either by calling the transpose function t() or
+by inputting the elements in the right order. The first solution
+requires to change the original
+
+
R
+
+
+output_vector2<-as.vector(output_matrix)
+
+
into
+
+
R
+
+
+output_vector2<-as.vector(t(output_matrix))
+
+
The second solution requires to change
+
+
R
+
+
+output_matrix[i, j]<-temp_output
+
+
into
+
+
R
+
+
+output_matrix[j, i]<-temp_output
+
+
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Write a script that loops through the gapminder data by
+continent and prints out whether the mean life expectancy is smaller or
+larger than 50 years.
+
+
+
+
+
+
+
+
+
Step 1: We want to make sure we can extract all the
+unique values of the continent vector
Step 2: We also need to loop over each of these
+continents and calculate the average life expectancy for each
+subset of data. We can do that as follows:
+
Loop over each of the unique values of ‘continent’
+
For each value of continent, create a temporary variable storing
+that subset
+
Return the calculated life expectancy to the user by printing the
+output:
Step 3: The exercise only wants the output printed
+if the average life expectancy is less than 50 or greater than 50. So we
+need to add an if() condition before printing, which
+evaluates whether the calculated average life expectancy is above or
+below a threshold, and prints an output conditional on the result. We
+need to amend (3) from above:
+
3a. If the calculated life expectancy is less than some threshold (50
+years), return the continent and a statement that life expectancy is
+less than threshold, otherwise return the continent and a statement that
+life expectancy is greater than threshold:
+
+
R
+
+
+thresholdValue<-50
+
+for(iContinentinunique(gapminder$continent)){
+tmp<-mean(gapminder[gapminder$continent==iContinent, "lifeExp"])
+
+if(tmp<thresholdValue){
+cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+}else{
+cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+}# end if else condition
+rm(tmp)
+}# end for loop
+
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Modify the script from Challenge 3 to loop over each country. This
+time print out whether the life expectancy is smaller than 50, between
+50 and 70, or greater than 70.
+
+
+
+
+
+
+
+
+
We modify our solution to Challenge 3 by now adding two thresholds,
+lowerThreshold and upperThreshold and
+extending our if-else statements:
Write a script that loops over each country in the
+gapminder dataset, tests whether the country starts with a
+‘B’, and graphs life expectancy against time as a line graph if the mean
+life expectancy is under 50 years.
+
+
+
+
+
+
+
+
+
We will use the grep() command that was introduced in
+the Unix
+Shell lesson to find countries that start with “B.” Lets understand
+how to do this first. Following from the Unix shell section we may be
+tempted to try the following
+
+
R
+
+
+grep("^B", unique(gapminder$country))
+
+
But when we evaluate this command it returns the indices of the
+factor variable country that start with “B.” To get the
+values, we must add the value=TRUE option to the
+grep() command:
+
+
R
+
+
+grep("^B", unique(gapminder$country), value =TRUE)
+
+
We will now store these countries in a variable called
+candidateCountries, and then loop over each entry in the variable.
+Inside the loop, we evaluate the average life expectancy for each
+country, and if the average life expectancy is less than 50 we use
+base-plot to plot the evolution of average life expectancy using
+with() and subset():
+
+
R
+
+
+thresholdValue<-50
+candidateCountries<-grep("^B", unique(gapminder$country), value =TRUE)
+
+for(iCountryincandidateCountries){
+tmp<-mean(gapminder[gapminder$country==iCountry, "lifeExp"])
+
+if(tmp<thresholdValue){
+cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+
+with(subset(gapminder, country==iCountry),
+plot(year, lifeExp,
+ type ="o",
+ main =paste("Life Expectancy in", iCountry, "over time"),
+ ylab ="Life Expectancy",
+ xlab ="Year"
+)# end plot
+)# end with
+}# end if
+rm(tmp)
+}# end for loop
Today we’ll be learning about the ggplot2 package, because it is the
+most effective for creating publication-quality graphics.
+
ggplot2 is built on the grammar of graphics, the idea that any plot
+can be built from the same set of components: a data
+set, mapping aesthetics, and graphical
+layers:
+
Data sets are the data that you, the user,
+provide.
+
Mapping aesthetics are what connect the data to
+the graphics. They tell ggplot2 how to use your data to affect how the
+graph looks, such as changing what is plotted on the X or Y axis, or the
+size or color of different data points.
+
Layers are the actual graphical output from
+ggplot2. Layers determine what kinds of plot are shown (scatterplot,
+histogram, etc.), the coordinate system used (rectangular, polar,
+others), and other important aspects of the plot. The idea of layers of
+graphics may be familiar to you if you have used image editing programs
+like Photoshop, Illustrator, or Inkscape.
+
Let’s start off building an example using the gapminder data from
+earlier. The most basic function is ggplot, which lets R
+know that we’re creating a new plot. Any of the arguments we give the
+ggplot function are the global options for the
+plot: they apply to all layers on the plot.
+
+
R
+
+
+library("ggplot2")
+ggplot(data =gapminder)
+
+
Here we called ggplot and told it what data we want to
+show on our figure. This is not enough information for
+ggplot to actually draw anything. It only creates a blank
+slate for other elements to be added to.
+
Now we’re going to add in the mapping aesthetics
+using the aes function. aes tells
+ggplot how variables in the data map to
+aesthetic properties of the figure, such as which columns of
+the data should be used for the x and
+y locations.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))
+
+
Here we told ggplot we want to plot the “gdpPercap”
+column of the gapminder data frame on the x-axis, and the “lifeExp”
+column on the y-axis. Notice that we didn’t need to explicitly pass
+aes these columns
+(e.g. x = gapminder[, "gdpPercap"]), this is because
+ggplot is smart enough to know to look in the
+data for that column!
+
The final part of making our plot is to tell ggplot how
+we want to visually represent the data. We do this by adding a new
+layer to the plot using one of the
+geom functions.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point()
+
+
Here we used geom_point, which tells ggplot
+we want to visually represent the relationship between
+x and y as a scatterplot of
+points.
+
+
+
+
+
+
Challenge 1
+
+
+
Modify the example so that the figure shows how life expectancy has
+changed over time:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+geom_point()
+
+
Hint: the gapminder dataset has a column called “year”, which should
+appear on the x-axis.
+
+
+
+
+
+
+
+
+
Here is one possible solution:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =year, y =lifeExp))+geom_point()
+
+
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
In the previous examples and challenge we’ve used the
+aes function to tell the scatterplot geom
+about the x and y locations of each
+point. Another aesthetic property we can modify is the point
+color. Modify the code from the previous challenge to
+color the points by the “continent” column. What trends
+do you see in the data? Are they what you expected?
+
+
+
+
+
+
+
+
+
The solution presented below adds color=continent to the
+call of the aes function. The general trend seems to
+indicate an increased life expectancy over the years. On continents with
+stronger economies we find a longer life expectancy.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_point()
+
+
+
+
+
+
Layers
+
+
Using a scatterplot probably isn’t the best for visualizing change
+over time. Instead, let’s tell ggplot to visualize the data
+as a line plot:
Instead of adding a geom_point layer, we’ve added a
+geom_line layer.
+
However, the result doesn’t look quite as we might have expected: it
+seems to be jumping around a lot in each continent. Let’s try to
+separate the data by country, plotting one line for each country:
It’s important to note that each layer is drawn on top of the
+previous layer. In this example, the points have been drawn on top
+of the lines. Here’s a demonstration:
In this example, the aesthetic mapping of
+color has been moved from the global plot options in
+ggplot to the geom_line layer so it no longer
+applies to the points. Now we can clearly see that the points are drawn
+on top of the lines.
+
+
+
+
+
+
Tip: Setting an aesthetic to a value instead
+of a mapping
+
+
+
So far, we’ve seen how to use an aesthetic (such as
+color) as a mapping to a variable in the data.
+For example, when we use
+geom_line(mapping = aes(color=continent)), ggplot will give
+a different color to each continent. But what if we want to change the
+color of all lines to blue? You may think that
+geom_line(mapping = aes(color="blue")) should work, but it
+doesn’t. Since we don’t want to create a mapping to a specific variable,
+we can move the color specification outside of the aes()
+function, like this: geom_line(color="blue").
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Switch the order of the point and line layers from the previous
+example. What happened?
ggplot2 also makes it easy to overlay statistical models over the
+data. To demonstrate we’ll go back to our first example:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point()
+
+
Currently it’s hard to see the relationship between the points due to
+some strong outliers in GDP per capita. We can change the scale of units
+on the x axis using the scale functions. These control the
+mapping between the data values and visual values of an aesthetic. We
+can also modify the transparency of the points, using the alpha
+function, which is especially helpful when you have a large amount of
+data which is very clustered.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()
+
+
The scale_x_log10 function applied a transformation to
+the coordinate system of the plot, so that each multiple of 10 is evenly
+spaced from left to right. For example, a GDP per capita of 1,000 is the
+same horizontal distance away from a value of 10,000 as the 10,000 value
+is from 100,000. This helps to visualize the spread of the data along
+the x-axis.
+
+
+
+
+
+
Tip Reminder: Setting an aesthetic to a value
+instead of a mapping
+
+
+
Notice that we used geom_point(alpha = 0.5). As the
+previous tip mentioned, using a setting outside of the
+aes() function will cause this value to be used for all
+points, which is what we want in this case. But just like any other
+aesthetic setting, alpha can also be mapped to a variable in
+the data. For example, we can give a different transparency to each
+continent with
+geom_point(mapping = aes(alpha = continent)).
+
+
+
+
We can fit a simple relationship to the data by adding another layer,
+geom_smooth:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()+geom_smooth(method="lm")
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
We can make the line thicker by setting the
+size aesthetic in the geom_smooth
+layer:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()+geom_smooth(method="lm", size=1.5)
+
+
+
WARNING
+
+
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+generated.
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
There are two ways an aesthetic can be specified. Here we
+set the size aesthetic by passing it as an
+argument to geom_smooth. Previously in the lesson we’ve
+used the aes function to define a mapping between
+data variables and their visual representation.
+
+
+
+
+
+
Challenge 4a
+
+
+
Modify the color and size of the points on the point layer in the
+previous example.
+
Hint: do not use the aes function.
+
+
+
+
+
+
+
+
+
Here a possible solution: Notice that the color argument
+is supplied outside of the aes() function. This means that
+it applies to all data points on the graph and is not related to a
+specific variable.
Modify your solution to Challenge 4a so that the points are now a
+different shape and are colored by continent with new trendlines. Hint:
+The color argument can be used inside the aesthetic.
+
+
+
+
+
+
+
+
+
Here is a possible solution: Notice that supplying the
+color argument inside the aes() functions
+enables you to connect it to a certain variable. The shape
+argument, as you can see, modifies all data points the same way (it is
+outside the aes() call) while the color
+argument which is placed inside the aes() call modifies a
+point’s color based on its continent value.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp, color =continent))+
+geom_point(size=3, shape=17)+scale_x_log10()+
+geom_smooth(method="lm", size=1.5)
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
+
+
+
Multi-panel figures
+
+
Earlier we visualized the change in life expectancy over time across
+all countries in one plot. Alternatively, we can split this out over
+multiple panels by adding a layer of facet panels.
+
+
+
+
+
+
Tip
+
+
+
We start by making a subset of data including only countries located
+in the Americas. This includes 25 countries, which will begin to clutter
+the figure. Note that we apply a “theme” definition to rotate the x-axis
+labels to maintain readability. Nearly everything in ggplot2 is
+customizable.
The facet_wrap layer took a “formula” as its argument,
+denoted by the tilde (~). This tells R to draw a panel for each unique
+value in the country column of the gapminder dataset.
+
Modifying text
+
+
To clean this figure up for a publication we need to change some of
+the text elements. The x-axis is too cluttered, and the y axis should
+read “Life expectancy”, rather than the column name in the data
+frame.
+
We can do this by adding a couple of different layers. The
+theme layer controls the axis text, and overall text
+size. Labels for the axes, plot title and any legend can be set using
+the labs function. Legend titles are set using the same
+names we used in the aes specification. Thus below the
+color legend title is set using color = "Continent", while
+the title of a fill legend would be set using
+fill = "MyTitle".
+
+
R
+
+
+ggplot(data =americas, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_line()+facet_wrap(~country)+
+labs(
+ x ="Year", # x axis title
+ y ="Life expectancy", # y axis title
+ title ="Figure 1", # main title of figure
+ color ="Continent"# title of legend
+)+
+theme(axis.text.x =element_text(angle =90, hjust =1))
+
+
Exporting the plot
+
+
The ggsave() function allows you to export a plot
+created with ggplot. You can specify the dimension and resolution of
+your plot by adjusting the appropriate arguments (width,
+height and dpi) to create high quality
+graphics for publication. In order to save the plot from above, we first
+assign it to a variable lifeExp_plot, then tell
+ggsave to save that plot in png format to a
+directory called results. (Make sure you have a
+results/ folder in your working directory.)
+
+
R
+
+
+lifeExp_plot<-ggplot(data =americas, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_line()+facet_wrap(~country)+
+labs(
+ x ="Year", # x axis title
+ y ="Life expectancy", # y axis title
+ title ="Figure 1", # main title of figure
+ color ="Continent"# title of legend
+)+
+theme(axis.text.x =element_text(angle =90, hjust =1))
+
+ggsave(filename ="results/lifeExp.png", plot =lifeExp_plot, width =12, height =10, dpi =300, units ="cm")
+
+
There are two nice things about ggsave. First, it
+defaults to the last plot, so if you omit the plot argument
+it will automatically save the last plot you created with
+ggplot. Secondly, it tries to determine the format you want
+to save your plot in from the file extension you provide for the
+filename (for example .png or .pdf). If you
+need to, you can specify the format explicitly in the
+device argument.
+
This is a taste of what you can do with ggplot2. RStudio provides a
+really useful cheat
+sheet of the different layers available, and more extensive
+documentation is available on the ggplot2 website. All
+RStudio cheat sheets can be found here. Finally,
+if you have no idea how to change something, a quick Google search will
+usually send you to a relevant question and answer on Stack Overflow
+with reusable code to modify!
+
+
+
+
+
+
Challenge 5
+
+
+
Generate boxplots to compare life expectancy between the different
+continents during the available years.
+
Advanced:
+
Rename y axis as Life Expectancy.
+
Remove x axis labels.
+
+
+
+
+
+
+
+
+
Here a possible solution: xlab() and ylab()
+set labels for the x and y axes, respectively The axis title, text and
+ticks are attributes of the theme and must be modified within a
+theme() call.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =continent, y =lifeExp, fill =continent))+
+geom_boxplot()+facet_wrap(~year)+
+ylab("Life Expectancy")+
+theme(axis.title.x=element_blank(),
+ axis.text.x =element_blank(),
+ axis.ticks.x =element_blank())
+
+
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
Use ggplot2 to create plots.
+
Think about graphics in layers: aesthetics, geometry, statistics,
+scale transformation, and grouping.
How can I operate on all the elements of a vector at once?
+
+
+
+
+
+
+
Objectives
+
To understand vectorized operations in R.
+
+
+
+
+
+
Most of R’s functions are vectorized, meaning that the function will
+operate on all elements of a vector without needing to loop through and
+act on each element one at a time. This makes writing code more concise,
+easy to read, and less error prone.
+
+
R
+
+
+x<-1:4
+x*2
+
+
+
OUTPUT
+
+
[1] 2 4 6 8
+
+
The multiplication happened to each element of the vector.
+
We can also add two vectors together:
+
+
R
+
+
+y<-6:9
+x+y
+
+
+
OUTPUT
+
+
[1] 7 9 11 13
+
+
Each element of x was added to its corresponding element
+of y:
+
+
R
+
+
x:1234
+++++
+y:6789
+---------------
+791113
+
+
Here is how we would add two vectors together using a for loop:
Compare this to the output using vectorised operations.
+
+
R
+
+
+sum_xy<-x+y
+sum_xy
+
+
+
OUTPUT
+
+
[1] 7 9 11 13
+
+
+
+
+
+
+
Challenge 1
+
+
+
Let’s try this on the pop column of the
+gapminder dataset.
+
Make a new column in the gapminder data frame that
+contains population in units of millions of people. Check the head or
+tail of the data frame to make sure it worked.
+
+
+
+
+
+
+
+
+
Let’s try this on the pop column of the
+gapminder dataset.
+
Make a new column in the gapminder data frame that
+contains population in units of millions of people. Check the head or
+tail of the data frame to make sure it worked.
Operations can also be performed on vectors of unequal length,
+through a process known as recycling. This process
+automatically repeats the smaller vector until it matches the length of
+the larger vector. R will provide a warning if the larger vector is not
+a multiple of the smaller vector.
+
+
R
+
+
+x<-c(1, 2, 3)
+y<-c(1, 2, 3, 4, 5, 6, 7)
+x+y
+
+
+
WARNING
+
+
Warning in x + y: longer object length is not a multiple of shorter object
+length
+
+
+
OUTPUT
+
+
[1] 2 4 6 5 7 9 8
+
+
Vector x was recycled to match the length of vector
+y
Check argument conditions with stopifnot() in
+functions.
+
Test a function.
+
Set default values for function arguments.
+
Explain why we should divide programs into small, single-purpose
+functions.
+
+
+
+
+
+
If we only had one data set to analyze, it would probably be faster
+to load the file into a spreadsheet and use that to plot simple
+statistics. However, the gapminder data is updated periodically, and we
+may want to pull in that new information later and re-run our analysis
+again. We may also obtain similar data from a different source in the
+future.
+
In this lesson, we’ll learn how to write a function so that we can
+repeat several operations with a single command.
+
+
+
+
+
+
What is a function?
+
+
+
Functions gather a sequence of operations into a whole, preserving it
+for ongoing use. Functions provide:
+
a name we can remember and invoke it by
+
relief from the need to remember the individual operations
+
a defined set of inputs and expected outputs
+
rich connections to the larger programming environment
+
As the basic building block of most programming languages,
+user-defined functions constitute “programming” as much as any single
+abstraction can. If you have written a function, you are a computer
+programmer.
+
+
+
+
Defining a function
+
+
Let’s open a new R script file in the functions/
+directory and call it functions-lesson.R.
+
The general structure of a function is:
+
+
R
+
+
+my_function<-function(parameters){
+# perform action
+# return value
+}
+
+
Let’s define a function fahr_to_kelvin() that converts
+temperatures from Fahrenheit to Kelvin:
We define fahr_to_kelvin() by assigning it to the output
+of function. The list of argument names are contained
+within parentheses. Next, the body of
+the function–the statements that are executed when it runs–is contained
+within curly braces ({}). The statements in the body are
+indented by two spaces. This makes the code easier to read but does not
+affect how the code operates.
+
It is useful to think of creating functions like writing a cookbook.
+First you define the “ingredients” that your function needs. In this
+case, we only need one ingredient to use our function: “temp”. After we
+list our ingredients, we then say what we will do with them, in this
+case, we are taking our ingredient and applying a set of mathematical
+operators to it.
+
When we call the function, the values we pass to it as arguments are
+assigned to those variables so that we can use them inside the function.
+Inside the function, we use a return statement to send a
+result back to whoever asked for it.
+
+
+
+
+
+
Tip
+
+
+
One feature unique to R is that the return statement is not required.
+R automatically returns whichever variable is on the last line of the
+body of the function. But for clarity, we will explicitly define the
+return statement.
+
+
+
+
Let’s try running our function. Calling our own function is no
+different from calling any other function:
+
+
R
+
+
+# freezing point of water
+fahr_to_kelvin(32)
+
+
+
OUTPUT
+
+
[1] 273.15
+
+
+
R
+
+
+# boiling point of water
+fahr_to_kelvin(212)
+
+
+
OUTPUT
+
+
[1] 373.15
+
+
+
+
+
+
+
Challenge 1
+
+
+
Write a function called kelvin_to_celsius() that takes a
+temperature in Kelvin and returns that temperature in Celsius.
+
Hint: To convert from Kelvin to Celsius you subtract 273.15
+
+
+
+
+
+
+
+
+
Write a function called kelvin_to_celsius that takes a
+temperature in Kelvin and returns that temperature in Celsius
Now that we’ve begun to appreciate how writing functions provides an
+efficient way to make R code re-usable and modular, we should note that
+it is important to ensure that functions only work in their intended
+use-cases. Checking function parameters is related to the concept of
+defensive programming. Defensive programming encourages us to
+frequently check conditions and throw an error if something is wrong.
+These checks are referred to as assertion statements because we want to
+assert some condition is TRUE before proceeding. They make
+it easier to debug because they give us a better idea of where the
+errors originate.
+
+
Checking conditions with stopifnot()
+
+
Let’s start by re-examining fahr_to_kelvin(), our
+function for converting temperatures from Fahrenheit to Kelvin. It was
+defined like so:
For this function to work as intended, the argument temp
+must be a numeric value; otherwise, the mathematical
+procedure for converting between the two temperature scales will not
+work. To create an error, we can use the function stop().
+For example, since the argument temp must be a
+numeric vector, we could check for this condition with an
+if statement and throw an error if the condition was
+violated. We could augment our function above like so:
+
+
R
+
+
+fahr_to_kelvin<-function(temp){
+if(!is.numeric(temp)){
+stop("temp must be a numeric vector.")
+}
+kelvin<-((temp-32)*(5/9))+273.15
+return(kelvin)
+}
+
+
If we had multiple conditions or arguments to check, it would take
+many lines of code to check all of them. Luckily R provides the
+convenience function stopifnot(). We can list as many
+requirements that should evaluate to TRUE;
+stopifnot() throws an error if it finds one that is
+FALSE. Listing these conditions also serves a secondary
+purpose as extra documentation for the function.
+
Let’s try out defensive programming with stopifnot() by
+adding assertions to check the input to our function
+fahr_to_kelvin().
+
We want to assert the following: temp is a numeric
+vector. We may do that like so:
+# freezing point of water
+fahr_to_kelvin(temp =32)
+
+
+
OUTPUT
+
+
[1] 273.15
+
+
But fails instantly if given improper input.
+
+
R
+
+
+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp =as.factor(32))
+
+
+
ERROR
+
+
Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+
+
+
+
+
+
+
Challenge 3
+
+
+
Use defensive programming to ensure that our
+fahr_to_celsius() function throws an error immediately if
+the argument temp is specified inappropriately.
+
+
+
+
+
+
+
+
+
Extend our previous definition of the function by adding in an
+explicit call to stopifnot(). Since
+fahr_to_celsius() is a composition of two other functions,
+checking inside here makes adding checks to the two component functions
+redundant.
Now, we’re going to define a function that calculates the Gross
+Domestic Product of a nation from the data available in our dataset:
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat){
+gdp<-dat$pop*dat$gdpPercap
+return(gdp)
+}
+
+
We define calcGDP() by assigning it to the output of
+function. The list of argument names are contained within
+parentheses. Next, the body of the function -- the statements executed
+when you call the function – is contained within curly braces
+({}).
+
We’ve indented the statements in the body by two spaces. This makes
+the code easier to read but does not affect how it operates.
+
When we call the function, the values we pass to it are assigned to
+the arguments, which become variables inside the body of the
+function.
+
Inside the function, we use the return() function to
+send back the result. This return() function is optional: R
+will automatically return the results of whatever command is executed on
+the last line of the function.
That’s not very informative. Let’s add some more arguments so we can
+extract that per year and country.
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat, year=NULL, country=NULL){
+if(!is.null(year)){
+dat<-dat[dat$year%in%year, ]
+}
+if(!is.null(country)){
+dat<-dat[dat$country%in%country,]
+}
+gdp<-dat$pop*dat$gdpPercap
+
+new<-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
If you’ve been writing these functions down into a separate R script
+(a good idea!), you can load in the functions into our R session by
+using the source() function:
+
+
R
+
+
+source("functions/functions-lesson.R")
+
+
Ok, so there’s a lot going on in this function now. In plain English,
+the function now subsets the provided data by year if the year argument
+isn’t empty, then subsets the result by country if the country argument
+isn’t empty. Then it calculates the GDP for whatever subset emerges from
+the previous two steps. The function then adds the GDP as a new column
+to the subsetted data and returns this as the final result. You can see
+that the output is much more informative than a vector of numbers.
+
Let’s take a look at what happens when we specify the year:
+
+
R
+
+
+head(calcGDP(gapminder, year=2007))
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap gdp
+12 Afghanistan 2007 31889923 Asia 43.828 974.5803 31079291949
+24 Albania 2007 3600523 Europe 76.423 5937.0295 21376411360
+36 Algeria 2007 33333216 Africa 72.301 6223.3675 207444851958
+48 Angola 2007 12420476 Africa 42.731 4797.2313 59583895818
+60 Argentina 2007 40301927 Americas 75.320 12779.3796 515033625357
+72 Australia 2007 20434176 Oceania 81.235 34435.3674 703658358894
+
+
Or for a specific country:
+
+
R
+
+
+calcGDP(gapminder, country="Australia")
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap gdp
+61 Australia 1952 8691212 Oceania 69.120 10039.60 87256254102
+62 Australia 1957 9712569 Oceania 70.330 10949.65 106349227169
+63 Australia 1962 10794968 Oceania 70.930 12217.23 131884573002
+64 Australia 1967 11872264 Oceania 71.100 14526.12 172457986742
+65 Australia 1972 13177000 Oceania 71.930 16788.63 221223770658
+66 Australia 1977 14074100 Oceania 73.490 18334.20 258037329175
+67 Australia 1982 15184200 Oceania 74.740 19477.01 295742804309
+68 Australia 1987 16257249 Oceania 76.320 21888.89 355853119294
+69 Australia 1992 17481977 Oceania 77.560 23424.77 409511234952
+70 Australia 1997 18565243 Oceania 78.830 26997.94 501223252921
+71 Australia 2002 19546792 Oceania 80.370 30687.75 599847158654
+72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894
Here we’ve added two arguments, year, and
+country. We’ve set default arguments for both as
+NULL using the = operator in the function
+definition. This means that those arguments will take on those values
+unless the user specifies otherwise.
Here, we check whether each additional argument is set to
+null, and whenever they’re not null overwrite
+the dataset stored in dat with a subset given by the
+non-null argument.
+
Building these conditionals into the function makes it more flexible
+for later. Now, we can use it to calculate the GDP for:
+
The whole dataset;
+
A single year;
+
A single country;
+
A single combination of year and country.
+
By using %in% instead, we can also give multiple years
+or countries to those arguments.
+
+
+
+
+
+
Tip: Pass by value
+
+
+
Functions in R almost always make copies of the data to operate on
+inside of a function body. When we modify dat inside the
+function we are modifying the copy of the gapminder dataset stored in
+dat, not the original variable we gave as the first
+argument.
+
This is called “pass-by-value” and it makes writing code much safer:
+you can always be sure that whatever changes you make within the body of
+the function, stay inside the body of the function.
+
+
+
+
+
+
+
+
+
Tip: Function scope
+
+
+
Another important concept is scoping: any variables (or functions!)
+you create or modify inside the body of a function only exist for the
+lifetime of the function’s execution. When we call
+calcGDP(), the variables dat, gdp
+and new only exist inside the body of the function. Even if
+we have variables of the same name in our interactive R session, they
+are not modified in any way when executing a function.
+
+
+
+
+
R
+
+
gdp <- dat$pop * dat$gdpPercap
+ new <-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
Finally, we calculated the GDP on our new subset, and created a new
+data frame with that column added. This means when we call the function
+later we can see the context for the returned GDP values, which is much
+better than in our first attempt where we got a vector of numbers.
+
+
+
+
+
+
Challenge 4
+
+
+
Test out your GDP function by calculating the GDP for New Zealand in
+1987. How does this differ from New Zealand’s GDP in 1952?
+
+
+
+
+
+
+
+
+
+
R
+
+
+calcGDP(gapminder, year =c(1952, 1987), country ="New Zealand")
+
+
GDP for New Zealand in 1987: 65050008703
+
GDP for New Zealand in 1952: 21058193787
+
+
+
+
+
+
+
+
+
+
Challenge 5
+
+
+
The paste() function can be used to combine text
+together, e.g:
Write a function called fence() that takes two vectors
+as arguments, called text and wrapper, and
+prints out the text wrapped with the wrapper:
+
+
R
+
+
+fence(text=best_practice, wrapper="***")
+
+
Note: the paste() function has an argument
+called sep, which specifies the separator between text. The
+default is a space: ” “. The default for paste0() is no
+space”“.
+
+
+
+
+
+
+
+
+
Write a function called fence() that takes two vectors
+as arguments, called text and wrapper, and
+prints out the text wrapped with the wrapper:
[1] "*** Write programs for people not computers ***"
+
+
+
+
+
+
+
+
+
+
+
Tip
+
+
+
R has some unique aspects that can be exploited when performing more
+complicated operations. We will not be writing anything that requires
+knowledge of these more advanced concepts. In the future when you are
+comfortable writing functions in R, you can learn more by reading the R
+Language Manual or this chapter from Advanced R Programming by Hadley
+Wickham.
+
+
+
+
+
+
+
+
+
Tip: Testing and documenting
+
+
+
It’s important to both test functions and document them:
+Documentation helps you, and others, understand what the purpose of your
+function is, and how to use it, and its important to make sure that your
+function actually does what you think.
+
When you first start out, your workflow will probably look a lot like
+this:
+
Write a function
+
Comment parts of the function to document its behaviour
+
Load in the source file
+
Experiment with it in the console to make sure it behaves as you
+expect
+
Make any necessary bug fixes
+
Rinse and repeat.
+
Formal documentation for functions, written in separate
+.Rd files, gets turned into the documentation you see in
+help files. The roxygen2
+package allows R coders to write documentation alongside the function
+code and then process it into the appropriate .Rd files.
+You will want to switch to this more formal method of writing
+documentation when you start writing more complicated R projects. In
+fact, packages are, in essence, bundles of functions with this formal
+documentation. Loading your own functions through
+source("functions.R") is equivalent to loading someone
+else’s functions (or your own one day!) through
+library("package").
+
Formal automated tests can be written using the testthat package.
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
Use function to define a new function in R.
+
Use parameters to pass values into functions.
+
Use stopifnot() to flexibly check function arguments in
+R.
You have already seen how to save the most recent plot you create in
+ggplot2, using the command ggsave. As a
+refresher:
+
+
R
+
+
+ggsave("My_most_recent_plot.pdf")
+
+
You can save a plot from within RStudio using the ‘Export’ button in
+the ‘Plot’ window. This will give you the option of saving as a .pdf or
+as .png, .jpg or other image formats.
+
Sometimes you will want to save plots without creating them in the
+‘Plot’ window first. Perhaps you want to make a pdf document with
+multiple pages: each one a different plot, for example. Or perhaps
+you’re looping through multiple subsets of a file, plotting data from
+each subset, and you want to save each plot, but obviously can’t stop
+the loop to click ‘Export’ for each one.
+
In this case you can use a more flexible approach. The function
+pdf creates a new pdf device. You can control the size and
+resolution using the arguments to this function.
+
+
R
+
+
+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country))+
+geom_line()+
+theme(legend.position ="none")
+
+# You then have to make sure to turn off the pdf device!
+
+dev.off()
+
+
Open up this document and have a look.
+
+
+
+
+
+
Challenge 1
+
+
+
Rewrite your ‘pdf’ command to print a second page in the pdf, showing
+a facet plot (hint: use facet_grid) of the same data with
+one panel per continent.
+
+
diff --git a/12-plyr.html b/12-plyr.html
new file mode 100644
index 000000000..7b00811df
--- /dev/null
+++ b/12-plyr.html
@@ -0,0 +1,1011 @@
+
+R for Reproducible Scientific Analysis: Splitting and Combining Data Frames with plyr
+ Skip to main content
+
How can I do different calculations on different sets of data?
+
+
+
+
+
+
+
Objectives
+
To be able to use the split-apply-combine strategy for data
+analysis.
+
+
+
+
+
+
Previously we looked at how you can use functions to simplify your
+code. We defined the calcGDP function, which takes the
+gapminder dataset, and multiplies the population and GDP per capita
+column. We also defined additional arguments so we could filter by
+year and country:
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat, year=NULL, country=NULL){
+if(!is.null(year)){
+dat<-dat[dat$year%in%year, ]
+}
+if(!is.null(country)){
+dat<-dat[dat$country%in%country,]
+}
+gdp<-dat$pop*dat$gdpPercap
+
+new<-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
A common task you’ll encounter when working with data, is that you’ll
+want to run calculations on different groups within the data. In the
+above, we were calculating the GDP by multiplying two columns together.
+But what if we wanted to calculated the mean GDP per continent?
+
We could run calcGDP and then take the mean of each
+continent:
But this isn’t very nice. Yes, by using a function, you have
+reduced a substantial amount of repetition. That is
+nice. But there is still repetition. Repeating yourself will cost you
+time, both now and later, and potentially introduce some nasty bugs.
+
We could write a new function that is flexible like
+calcGDP, but this also takes a substantial amount of effort
+and testing to get right.
+
The abstract problem we’re encountering here is know as
+“split-apply-combine”:
+
We want to split our data into groups, in this case
+continents, apply some calculations on that group, then
+optionally combine the results together afterwards.
+
The plyr package
+
+
For those of you who have used R before, you might be familiar with
+the apply family of functions. While R’s built in functions
+do work, we’re going to introduce you to another method for solving the
+“split-apply-combine” problem. The plyr package provides a set of
+functions that we find more user friendly for solving this problem.
+
We installed this package in an earlier challenge. Let us load it
+now:
+
+
R
+
+
+library("plyr")
+
+
Plyr has functions for operating on lists,
+data.frames and arrays (matrices, or
+n-dimensional vectors). Each function performs:
+
A splitting operation
+
+Apply a function on each split in turn.
+
Recombine output data as a single data object.
+
The functions are named based on the data structure they expect as
+input, and the data structure you want returned as output: [a]rray,
+[l]ist, or [d]ata.frame. The first letter corresponds to the input data
+structure, the second letter to the output data structure, and then the
+rest of the function is named “ply”.
+
This gives us 9 core functions **ply. There are an additional three
+functions which will only perform the split and apply steps, and not any
+combine step. They’re named by their input data type and represent null
+output by a _ (see table)
+
Note here that plyr’s use of “array” is different to R’s, an array in
+ply can include a vector or matrix.
+
Each of the xxply functions (daply, ddply,
+llply, laply, …) has the same structure and
+has 4 key features and structure:
+
+
R
+
+
+xxply(.data, .variables, .fun)
+
+
The first letter of the function name gives the input type and the
+second gives the output type.
+
.data - gives the data object to be processed
+
.variables - identifies the splitting variables
+
.fun - gives the function to be called on each piece
+
Now we can quickly calculate the mean GDP per continent:
continent V1
+1 Africa 20904782844
+2 Americas 379262350210
+3 Asia 227233738153
+4 Europe 269442085301
+5 Oceania 188187105354
+
+
Let us walk through the previous code:
+
The ddply function feeds in a data.frame
+(function starts with d) and returns another
+data.frame (2nd letter is a d)
+
the first argument we gave was the data.frame we wanted to operate
+on: in this case the gapminder data. We called calcGDP on
+it first so that it would have the additional gdp column
+added to it.
+
The second argument indicated our split criteria: in this case the
+“continent” column. Note that we gave the name of the column, not the
+values of the column like we had done previously with subsetting. Plyr
+takes care of these implementation details for you.
+
The third argument is the function we want to apply to each grouping
+of the data. We had to define our own short function here: each subset
+of the data gets stored in x, the first argument of our
+function. This is an anonymous function: we haven’t defined it
+elsewhere, and it has no name. It only exists in the scope of our call
+to ddply.
+
+
+
+
+
+
Challenge 1
+
+
+
Calculate the average life expectancy per continent. Which has the
+longest? Which has the shortest?
year
+continent 1952 1957 1962 1967 1972
+ Africa 5992294608 7359188796 8784876958 11443994101 15072241974
+ Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+ Asia 34095762661 47267432088 60136869012 84648519224 124385747313
+ Europe 84971341466 109989505140 138984693095 173366641137 218691462733
+ Oceania 54157223944 66826828013 82336453245 105958863585 134112109227
+ year
+continent 1977 1982 1987 1992 1997
+ Africa 18694898732 22040401045 24107264108 26256977719 30023173824
+ Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+ Asia 159802590186 194429049919 241784763369 307100497486 387597655323
+ Europe 255367522034 279484077072 316507473546 342703247405 383606933833
+ Oceania 154707711162 176177151380 209451563998 236319179826 289304255183
+ year
+continent 2002 2007
+ Africa 35303511424 45778570846
+ Americas 661248623419 776723426068
+ Asia 458042336179 627513635079
+ Europe 436448815097 493183311052
+ Oceania 345236880176 403657044512
+
+
You can use these functions in place of for loops (and
+it is usually faster to do so). To replace a for loop, put the code that
+was in the body of the for loop inside an anonymous
+function.
+
+
R
+
+
+d_ply(
+ .data=gapminder,
+ .variables ="continent",
+ .fun =function(x){
+meanGDPperCap<-mean(x$gdpPercap)
+print(paste(
+"The mean GDP per capita for", unique(x$continent),
+"is", format(meanGDPperCap, big.mark=",")
+))
+}
+)
+
+
+
OUTPUT
+
+
[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+
+
+
+
+
+
+
Tip: printing numbers
+
+
+
The format function can be used to make numeric values
+“pretty” for printing out in messages.
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Calculate the average life expectancy per continent and year. Which
+had the longest and shortest in 2007? Which had the greatest change in
+between 1952 and 2007?
How can I manipulate data frames without repeating myself?
+
+
+
+
+
+
+
Objectives
+
To be able to use the six main data frame manipulation ‘verbs’ with
+pipes in dplyr.
+
To understand how group_by() and
+summarize() can be combined to summarize datasets.
+
Be able to analyze a subset of data using logical filtering.
+
+
+
+
+
+
Manipulation of data frames means many things to many researchers: we
+often select certain observations (rows) or variables (columns), we
+often group the data by a certain variable(s), or we even calculate
+summary statistics. We can do these operations using the normal base R
+operations:
But this isn’t very nice because there is a fair bit of
+repetition. Repeating yourself will cost you time, both now and later,
+and potentially introduce some nasty bugs.
+
The dplyr package
+
+
Luckily, the dplyr
+package provides a number of very useful functions for manipulating data
+frames in a way that will reduce the above repetition, reduce the
+probability of making errors, and probably even save you some typing. As
+an added bonus, you might even find the dplyr grammar
+easier to read.
+
+
+
+
+
+
Tip: Tidyverse
+
+
+
dplyr package belongs to a broader family of opinionated
+R packages designed for data science called the “Tidyverse”. These
+packages are specifically designed to work harmoniously together. Some
+of these packages will be covered along this course, but you can find
+more complete information here: https://www.tidyverse.org/.
+
+
+
+
Here we’re going to cover 5 of the most commonly used functions as
+well as using pipes (%>%) to combine them.
+
select()
+
filter()
+
group_by()
+
summarize()
+
mutate()
+
If you have have not installed this package earlier, please do
+so:
+
+
R
+
+
+install.packages('dplyr')
+
+
Now let’s load the package:
+
+
R
+
+
+library("dplyr")
+
+
Using select()
+
+
If, for example, we wanted to move forward with only a few of the
+variables in our data frame we could use the select()
+function. This will keep only the variables you select.
If we open up year_country_gdp we’ll see that it only
+contains the year, country and gdpPercap. Above we used ‘normal’
+grammar, but the strengths of dplyr lie in combining
+several functions using pipes. Since the pipes grammar is unlike
+anything we’ve seen in R before, let’s repeat what we’ve done above
+using pipes.
To help you understand why we wrote that in that way, let’s walk
+through it step by step. First we summon the gapminder data frame and
+pass it on, using the pipe symbol %>%, to the next step,
+which is the select() function. In this case we don’t
+specify which data object we use in the select() function
+since in gets that from the previous pipe. Fun Fact:
+There is a good chance you have encountered pipes before in the shell.
+In R, a pipe symbol is %>% while in the shell it is
+| but the concept is the same!
+
+
+
+
+
+
Tip: Renaming data frame columns in dplyr
+
+
+
In Chapter 4 we covered how you can rename columns with base R by
+assigning a value to the output of the names() function.
+Just like select, this is a bit cumbersome, but thankfully dplyr has a
+rename() function.
+
Within a pipeline, the syntax is
+rename(new_name = old_name). For example, we may want to
+rename the gdpPercap column name from our select()
+statement above.
Write a single command (which can span multiple lines and includes
+pipes) that will produce a data frame that has the African values for
+lifeExp, country and year, but
+not for other Continents. How many rows does your data frame have and
+why?
As with last time, first we pass the gapminder data frame to the
+filter() function, then we pass the filtered version of the
+gapminder data frame to the select() function.
+Note: The order of operations is very important in this
+case. If we used ‘select’ first, filter would not be able to find the
+variable continent since we would have removed it in the previous
+step.
+
Using group_by()
+
+
Now, we were supposed to be reducing the error prone repetitiveness
+of what can be done with base R, but up to now we haven’t done that
+since we would have to repeat the above for each continent. Instead of
+filter(), which will only pass observations that meet your
+criteria (in the above: continent=="Europe"), we can use
+group_by(), which will essentially use every unique
+criteria that you could have used in filter.
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
You will notice that the structure of the data frame where we used
+group_by() (grouped_df) is not the same as the
+original gapminder (data.frame). A
+grouped_df can be thought of as a list where
+each item in the listis a data.frame which
+contains only the rows that correspond to the a particular value
+continent (at least in the example above).
+
Using summarize()
+
+
The above was a bit on the uneventful side but
+group_by() is much more exciting in conjunction with
+summarize(). This will allow us to create new variable(s)
+by using functions that repeat for each of the continent-specific data
+frames. That is to say, using the group_by() function, we
+split our original data frame into multiple pieces, then we can run
+functions (e.g. mean() or sd()) within
+summarize().
# A tibble: 2 × 2
+ country mean_lifeExp
+ <chr> <dbl>
+1 Iceland 76.5
+2 Sierra Leone 36.8
+
+
Another way to do this is to use the dplyr function
+arrange(), which arranges the rows in a data frame
+according to the order of one or more variables from the data frame. It
+has similar syntax to other functions from the dplyr
+package. You can use desc() inside arrange()
+to sort in descending order.
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
count() and n()
+
+
A very common operation is to count the number of observations for
+each group. The dplyr package comes with two related
+functions that help with this.
+
For instance, if we wanted to check the number of countries included
+in the dataset for the year 2002, we can use the count()
+function. It takes the name of one or more columns that contain the
+groups we are interested in, and we can optionally sort the results in
+descending order by adding sort=TRUE:
continent n
+1 Africa 52
+2 Asia 33
+3 Europe 30
+4 Americas 25
+5 Oceania 2
+
+
If we need to use the number of observations in calculations, the
+n() function is useful. It will return the total number of
+observations in the current group rather than counting the number of
+observations in each group within a specific column. For instance, if we
+wanted to get the standard error of the life expectency per
+continent:
# A tibble: 5 × 2
+ continent se_le
+ <chr> <dbl>
+1 Africa 0.366
+2 Americas 0.540
+3 Asia 0.596
+4 Europe 0.286
+5 Oceania 0.775
+
+
You can also chain together several summary operations; in this case
+calculating the minimum, maximum,
+mean and se of each continent’s per-country
+life-expectancy:
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
Connect mutate with logical filtering: ifelse
+
+
When creating new variables, we can hook this with a logical
+condition. A simple combination of mutate() and
+ifelse() facilitates filtering right where it is needed: in
+the moment of creating something new. This easy-to-read statement is a
+fast and powerful way of discarding certain data (even though the
+overall dimension of the data frame will not change) or for updating
+values depending on this given condition.
+
+
R
+
+
+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25<-gapminder%>%
+mutate(gdp_billion =ifelse(lifeExp>25, gdpPercap*pop/10^9, NA))%>%
+group_by(continent, year)%>%
+summarize(mean_gdpPercap =mean(gdpPercap),
+ sd_gdpPercap =sd(gdpPercap),
+ mean_pop =mean(pop),
+ sd_pop =sd(pop),
+ mean_gdp_billion =mean(gdp_billion),
+ sd_gdp_billion =sd(gdp_billion))
+
+
+
OUTPUT
+
+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+
R
+
+
+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp<-gapminder%>%
+mutate(gdp_futureExpectation =ifelse(lifeExp>40, gdpPercap*1.5, gdpPercap))%>%
+group_by(continent, year)%>%
+summarize(mean_gdpPercap =mean(gdpPercap),
+ mean_gdpPercap_expected =mean(gdp_futureExpectation))
+
+
+
OUTPUT
+
+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
Combining dplyr and ggplot2
+
+
First install and load ggplot2:
+
+
R
+
+
+install.packages('ggplot2')
+
+
+
R
+
+
+library("ggplot2")
+
+
In the plotting lesson we looked at how to make a multi-panel figure
+by adding a layer of facet panels using ggplot2. Here is
+the code we used (with some extra comments):
+
+
R
+
+
+# Filter countries located in the Americas
+americas<-gapminder[gapminder$continent=="Americas", ]
+# Make the plot
+ggplot(data =americas, mapping =aes(x =year, y =lifeExp))+
+geom_line()+
+facet_wrap(~country)+
+theme(axis.text.x =element_text(angle =45))
+
+
This code makes the right plot but it also creates an intermediate
+variable (americas) that we might not have any other uses
+for. Just as we used %>% to pipe data along a chain of
+dplyr functions we can use it to pass data to
+ggplot(). Because %>% replaces the first
+argument in a function we don’t need to specify the data =
+argument in the ggplot() function. By combining
+dplyr and ggplot2 functions we can make the
+same figure without creating any new variables or modifying the
+data.
+
+
R
+
+
+gapminder%>%
+# Filter countries located in the Americas
+filter(continent=="Americas")%>%
+# Make the plot
+ggplot(mapping =aes(x =year, y =lifeExp))+
+geom_line()+
+facet_wrap(~country)+
+theme(axis.text.x =element_text(angle =45))
+
+
More examples of using the function mutate() and the
+ggplot2 package.
+
+
R
+
+
+gapminder%>%
+# extract first letter of country name into new column
+mutate(startsWith =substr(country, 1, 1))%>%
+# only keep countries starting with A or Z
+filter(startsWith%in%c("A", "Z"))%>%
+# plot lifeExp into facets
+ggplot(aes(x =year, y =lifeExp, colour =continent))+
+geom_line()+
+facet_wrap(vars(country))+
+theme_minimal()
+
+
+
+
+
+
+
Advanced Challenge
+
+
+
Calculate the average life expectancy in 2002 of 2 randomly selected
+countries for each continent. Then arrange the continent names in
+reverse order. Hint: Use the dplyr
+functions arrange() and sample_n(), they have
+similar syntax to other dplyr functions.
To understand the concepts of ‘longer’ and ‘wider’ data frame
+formats and be able to convert between them with
+tidyr.
+
+
+
+
+
+
Researchers often want to reshape their data frames from ‘wide’ to
+‘longer’ layouts, or vice-versa. The ‘long’ layout or format is
+where:
+
each column is a variable
+
each row is an observation
+
In the purely ‘long’ (or ‘longest’) format, you usually have 1 column
+for the observed variable and the other columns are ID variables.
+
For the ‘wide’ format each row is often a site/subject/patient and
+you have multiple observation variables containing the same type of
+data. These can be either repeated observations over time, or
+observation of multiple variables (or a mix of both). You may find data
+input may be simpler or some other applications may prefer the ‘wide’
+format. However, many of R‘s functions have been designed
+assuming you have ’longer’ formatted data. This tutorial will help you
+efficiently transform your data shape regardless of original format.
+
Long and wide data frame layouts mainly affect readability. For
+humans, the wide format is often more intuitive since we can often see
+more of the data on the screen due to its shape. However, the long
+format is more machine readable and is closer to the formatting of
+databases. The ID variables in our data frames are similar to the fields
+in a database and observed variables are like the database values.
+
Getting started
+
+
First install the packages if you haven’t already done so (you
+probably installed dplyr in the previous lesson):
First, lets look at the structure of our original gapminder data
+frame:
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+
+
+
+
+
+
+
Challenge 1
+
+
+
Is gapminder a purely long, purely wide, or some intermediate
+format?
+
+
+
+
+
+
+
+
+
The original gapminder data.frame is in an intermediate format. It is
+not purely long since it had multiple observation variables
+(pop,lifeExp,gdpPercap).
+
+
+
+
+
Sometimes, as with the gapminder dataset, we have multiple types of
+observed data. It is somewhere in between the purely ‘long’ and ‘wide’
+data formats. We have 3 “ID variables” (continent,
+country, year) and 3 “Observation variables”
+(pop,lifeExp,gdpPercap). This
+intermediate format can be preferred despite not having ALL observations
+in 1 column given that all 3 observation variables have different units.
+There are few operations that would need us to make this data frame any
+longer (i.e. 4 ID variables and 1 Observation variable).
+
While using many of the functions in R, which are often vector based,
+you usually do not want to do mathematical operations on values with
+different units. For example, using the purely long format, a single
+mean for all of the values of population, life expectancy, and GDP would
+not be meaningful since it would return the mean of values with 3
+incompatible units. The solution is that we first manipulate the data
+either by grouping (see the lesson on dplyr), or we change
+the structure of the data frame. Note: Some plotting
+functions in R actually work better in the wide format data.
+
From wide to long format with pivot_longer()
+
+
Until now, we’ve been using the nicely formatted original gapminder
+dataset, but ‘real’ data (i.e. our own research data) will never be so
+well organized. Here let’s start with the wide formatted version of the
+gapminder dataset.
+
+
Download the wide version of the gapminder data from here and save it in your data
+folder.
+
+
We’ll load the data file and look at it. Note: we don’t want our
+continent and country columns to be factors, so we use the
+stringsAsFactors argument for read.csv() to disable
+that.
To change this very wide data frame layout back to our nice,
+intermediate (or longer) layout, we will use one of the two available
+pivot functions from the tidyr package. To
+convert from wide to a longer format, we will use the
+pivot_longer() function. pivot_longer() makes
+datasets longer by increasing the number of rows and decreasing the
+number of columns, or ‘lengthening’ your observation variables into a
+single variable.
Here we have used piping syntax which is similar to what we were
+doing in the previous lesson with dplyr. In fact, these are compatible
+and you can use a mix of tidyr and dplyr functions by piping them
+together.
+
We first provide to pivot_longer() a vector of column
+names that will be pivoted into longer format. We could type out all the
+observation variables, but as in the select() function (see
+dplyr lesson), we can use the starts_with()
+argument to select all variables that start with the desired character
+string. pivot_longer() also allows the alternative syntax
+of using the - symbol to identify which variables are not
+to be pivoted (i.e. ID variables).
+
The next arguments to pivot_longer() are
+names_to for naming the column that will contain the new ID
+variable (obstype_year) and values_to for
+naming the new amalgamated observation variable
+(obs_value). We supply these new column names as
+strings.
That may seem trivial with this particular data frame, but sometimes
+you have 1 ID variable and 40 observation variables with irregular
+variable names. The flexibility is a huge time saver!
+
Now obstype_year actually contains 2 pieces of
+information, the observation type
+(pop,lifeExp, or gdpPercap) and
+the year. We can use the separate() function
+to split the character strings into multiple variables
+
+
R
+
+
+gap_long<-gap_long%>%separate(obstype_year, into =c('obs_type', 'year'), sep ="_")
+gap_long$year<-as.integer(gap_long$year)
+
+
+
+
+
+
+
Challenge 2
+
+
+
Using gap_long, calculate the mean life expectancy,
+population, and gdpPercap for each continent. Hint: use
+the group_by() and summarize() functions we
+learned in the dplyr lesson
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+
OUTPUT
+
+
# A tibble: 15 × 3
+# Groups: continent [5]
+ continent obs_type means
+ <chr> <chr> <dbl>
+ 1 Africa gdpPercap 2194.
+ 2 Africa lifeExp 48.9
+ 3 Africa pop 9916003.
+ 4 Americas gdpPercap 7136.
+ 5 Americas lifeExp 64.7
+ 6 Americas pop 24504795.
+ 7 Asia gdpPercap 7902.
+ 8 Asia lifeExp 60.1
+ 9 Asia pop 77038722.
+10 Europe gdpPercap 14469.
+11 Europe lifeExp 71.9
+12 Europe pop 17169765.
+13 Oceania gdpPercap 18622.
+14 Oceania lifeExp 74.3
+15 Oceania pop 8874672.
+
+
+
+
+
+
From long to intermediate format with pivot_wider()
+
+
It is always good to check work. So, let’s use the second
+pivot function, pivot_wider(), to ‘widen’ our
+observation variables back out. pivot_wider() is the
+opposite of pivot_longer(), making a dataset wider by
+increasing the number of columns and decreasing the number of rows. We
+can use pivot_wider() to pivot or reshape our
+gap_long to the original intermediate format or the widest
+format. Let’s start with the intermediate format.
+
The pivot_wider() function takes names_from
+and values_from arguments.
+
To names_from we supply the column name whose contents
+will be pivoted into new output columns in the widened data frame. The
+corresponding values will be added from the column named in the
+values_from argument.
Now we’ve got an intermediate data frame gap_normal with
+the same dimensions as the original gapminder, but the
+order of the variables is different. Let’s fix that before checking if
+they are all.equal().
That’s great! We’ve gone from the longest format back to the
+intermediate and we didn’t introduce any errors in our code.
+
Now let’s convert the long all the way back to the wide. In the wide
+format, we will keep country and continent as ID variables and pivot the
+observations across the 3 metrics
+(pop,lifeExp,gdpPercap) and time
+(year). First we need to create appropriate labels for all
+our new variables (time*metric combinations) and we also need to unify
+our ID variables to simplify the process of defining
+gap_wide.
Using unite() we now have a single ID variable which is
+a combination of continent,country,and we have
+defined variable names. We’re now ready to pipe in
+pivot_wider()
Take this 1 step further and create a
+gap_ludicrously_wide format data by pivoting over
+countries, year and the 3 metrics? Hint this new data
+frame should only have 5 rows.
Understand the value of writing reproducible reports
+
Learn how to recognise and compile the basic components of an R
+Markdown file
+
Become familiar with R code chunks, and understand their purpose,
+structure and options
+
Demonstrate the use of inline chunks for weaving R outputs into text
+blocks, for example when discussing the results of some
+calculations
+
Be aware of alternative output formats to which an R Markdown file
+can be exported
+
+
+
+
+
+
Data analysis reports
+
+
Data analysts tend to write a lot of reports, describing their
+analyses and results, for their collaborators or to document their work
+for future reference.
+
Many new users begin by first writing a single R script containing
+all of their work, and then share the analysis by emailing the script
+and various graphs as attachments. But this can be cumbersome, requiring
+a lengthy discussion to explain which attachment was which result.
+
Writing formal reports with Word or LaTeX can simplify this
+process by incorporating both the analysis report and output graphs into
+a single document. But tweaking formatting to make figures look correct
+and fixing obnoxious page breaks can be tedious and lead to a lengthy
+“whack-a-mole” game of fixing new mistakes resulting from a single
+formatting change.
+
Creating a report as a web page (which is an html file) using R
+Markdown makes things easier. The report can be one long stream, so tall
+figures that wouldn’t ordinarily fit on one page can be kept at full
+size and easier to read, since the reader can simply keep scrolling.
+Additionally, the formatting of and R Markdown document is simple and
+easy to modify, allowing you to spend more time on your analyses instead
+of writing reports.
+
Literate programming
+
+
Ideally, such analysis reports are reproducible documents:
+If an error is discovered, or if some additional subjects are added to
+the data, you can just re-compile the report and get the new or
+corrected results rather than having to reconstruct figures, paste them
+into a Word document, and hand-edit various detailed results.
+
The key R package here is knitr. It allows you
+to create a document that is a mixture of text and chunks of code. When
+the document is processed by knitr, chunks of code will be
+executed, and graphs or other results will be inserted into the final
+document.
+
This sort of idea has been called “literate programming”.
+
knitr allows you to mix basically any type of text with
+code from different programming languages, but we recommend that you use
+R Markdown, which mixes Markdown with R. Markdown is a light-weight
+mark-up language for creating web pages.
+
Creating an R Markdown file
+
+
Within RStudio, click File → New File → R Markdown and you’ll get a
+dialog box like this:
+
You can stick with the default (HTML output), but give it a
+title.
+
Basic components of R Markdown
+
+
The initial chunk of text (header) contains instructions for R to
+specify what kind of document will be created, and the options chosen.
+You can use the header to give your document a title, author, date, and
+tell it what type of output you want to produce. In this case, we’re
+creating an html document.
You can delete any of those fields if you don’t want them included.
+The double-quotes aren’t strictly necessary in this case.
+They’re mostly needed if you want to include a colon in the title.
+
RStudio creates the document with some example text to get you
+started. Note below that there are chunks like
+
+```{r}
+summary(cars)
+```
+
+
These are chunks of R code that will be executed by
+knitr and replaced by their results. More on this
+later.
+
Markdown
+
+
Markdown is a system for writing web pages by marking up the text
+much as you would in an email rather than writing html code. The
+marked-up text gets converted to html, replacing the marks with
+the proper html code.
+
For now, let’s delete all of the stuff that’s there and write a bit
+of markdown.
+
You make things bold using two asterisks, like this:
+**bold**, and you make things italics by using
+underscores, like this: _italics_.
+
You can make a bulleted list by writing a list with hyphens or
+asterisks with a space between the list and other text, like this:
+
A list:
+
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks
+
or like this:
+
A second list:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+
Each will appear as:
+
bold with double-asterisks
+
italics with underscores
+
code-type font with backticks
+
You can use whatever method you prefer, but be consistent.
+This maintains the readability of your code.
+
You can make a numbered list by just using numbers. You can even use
+the same number over and over if you want:
+
1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks
+
This will appear as:
+
bold with double-asterisks
+
italics with underscores
+
code-type font with backticks
+
You can make section headers of different sizes by initiating a line
+with some number of # symbols:
+
# Title
+## Main section
+### Sub-section
+#### Sub-sub section
+
You compile the R Markdown document to an html webpage by
+clicking the “Knit” button in the upper-left.
+
+
+
+
+
+
Challenge 1
+
+
+
Create a new R Markdown document. Delete all of the R code chunks and
+write a bit of Markdown (some sections, some italicized text, and an
+itemized list).
+
Convert the document to a webpage.
+
+
+
+
+
+
+
+
+
In RStudio, select File > New file > R Markdown…
+
Delete the placeholder text and add the following:
+
# Introduction
+
+## Background on Data
+
+This report uses the *gapminder* dataset, which has columns that include:
+
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+
+## Background on Methods
+
+
Then click the ‘Knit’ button on the toolbar to generate an html
+document (webpage).
+
+
+
+
+
A bit more Markdown
+
+
You can make a hyperlink like this:
+[Carpentries Home Page](https://carpentries.org/).
+
You can include an image file like this:
+![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)
+
You can do subscripts (e.g., F2) with F~2~
+and superscripts (e.g., F2) with F^2^.
+
If you know how to write equations in LaTeX, you can use
+$ $ and $$ $$ to insert math equations, like
+$E = mc^2$ and
+
$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$
+
You can review Markdown syntax by navigating to the “Markdown Quick
+Reference” under the “Help” field in the toolbar at the top of
+RStudio.
+
R code chunks
+
+
The real power of Markdown comes from mixing markdown with chunks of
+code. This is R Markdown. When processed, the R code will be executed;
+if they produce figures, the figures will be inserted in the final
+document.
+
The main code chunks look like this:
+
+```{r load_data}
+gapminder
+
That is, you place a chunk of R code between ```{r
+chunk_name} and ```. You should give each chunk a
+unique name, as they will help you to fix errors and, if any graphs are
+produced, the file names are based on the name of the code chunk that
+produced them. You can create code chunks quickly in RStudio using the
+shortcuts Ctrl+Alt+I on Windows and
+Linux, or Cmd+Option+I on Mac.
+
+
+
+
+
+
Challenge 2
+
+
+
Add code chunks to:
+
Load the ggplot2 package
+
Read the gapminder data
+
Create a plot
+
+
+
+
+
+
+
+
+
+```{r load-ggplot2}
+library("ggplot2")
+```
+
+
+```{r read-gapminder-data}
+gapminder
+
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+
+
+
+
+
+
How things get compiled
+
+
When you press the “Knit” button, the R Markdown document is
+processed by knitr
+and a plain Markdown document is produced (as well as, potentially, a
+set of figure files): the R code is executed and replaced by both the
+input and the output; if figures are produced, links to those figures
+are included.
+
The Markdown and figure documents are then processed by the tool pandoc, which converts the
+Markdown file into an html file, with the figures embedded.
+
Chunk options
+
+
There are a variety of options to affect how the code chunks are
+treated. Here are some examples:
+
Use echo=FALSE to avoid having the code itself
+shown.
+
Use results="hide" to avoid having any results
+printed.
+
Use eval=FALSE to have the code shown but not
+evaluated.
+
Use warning=FALSE and message=FALSE to
+hide any warnings or messages produced.
+
Use fig.height and fig.width to control
+the size of the figures produced (in inches).
The fig.path option defines where the figures will be
+saved. The / here is really important; without it, the
+figures would be saved in the standard place but just with names that
+begin with Figs.
+
If you have multiple R Markdown files in a common directory, you
+might want to use fig.path to define separate prefixes for
+the figure file names, like fig.path="Figs/cleaning-" and
+fig.path="Figs/analysis-".
+
+
+
+
+
+
Challenge 3
+
+
+
Use chunk options to control the size of a figure and to hide the
+code.
You can review all of the R chunk options by navigating
+to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the
+“Help” field in the toolbar at the top of RStudio.
+
Inline R code
+
+
You can make every number in your report reproducible. Use
+`r and ` for an in-line code chunk, like so:
+`r round(some_value, 2)`. The code will be executed and
+replaced with the value of the result.
+
Don’t let these in-line chunks get split across lines.
+
Perhaps precede the paragraph with a larger code chunk that does
+calculations and defines variables, with include=FALSE for
+that larger chunk (which is the same as echo=FALSE and
+results="hide").
+
Rounding can produce differences in output in such situations. You
+may want 2.0, but round(2.03, 1) will give
+just 2.
+
The myround
+function in the R/broman
+package handles this.
+
+
+
+
+
+
Challenge 4
+
+
+
Try out a bit of in-line R code.
+
+
+
+
+
+
+
+
+
Here’s some inline code to determine that 2 + 2 = 4.
+
+
+
+
+
Other output options
+
+
You can also convert R Markdown to a PDF or a Word document. Click
+the little triangle next to the “Knit” button to get a drop-down menu.
+Or you could put pdf_document or word_document
+in the initial header of the file.
+
+
+
+
+
+
Tip: Creating PDF documents
+
+
+
Creating .pdf documents may require installation of some extra
+software. The R package tinytex provides some tools to help
+make this process easier for R users. With tinytex
+installed, run tinytex::install_tinytex() to install the
+required software (you’ll only need to do this once) and then when you
+knit to pdf tinytex will automatically detect and install
+any additional LaTeX packages that are needed to produce the pdf
+document. Visit the tinytex
+website for more information.
+
+
+
+
+
+
+
+
+
Tip: Visual markdown editing in RStudio
+
+
+
RStudio versions 1.4 and later include visual markdown editing mode.
+In visual editing mode, markdown expressions (like
+**bold words**) are transformed to the formatted appearance
+(bold words) as you type. This mode also includes a
+toolbar at the top with basic formatting buttons, similar to what you
+might see in common word processing software programs. You can turn
+visual editing on and off by pressing the button in the top right corner of your
+R Markdown document.
How can I write software that other people can use?
+
+
+
+
+
+
+
Objectives
+
Describe best practices for writing R and explain the justification
+for each.
+
+
+
+
+
+
Structure your project folder
+
+
Keep your project folder structured, organized and tidy, by creating
+subfolders for your code files, manuals, data, binaries, output plots,
+etc. It can be done completely manually, or with the help of RStudio’s
+New Project functionality, or a designated package, such as
+ProjectTemplate.
+
+
+
+
+
+
Tip: ProjectTemplate - a possible
+solution
+
+
+
One way to automate the management of projects is to install the
+third-party package, ProjectTemplate. This package will set
+up an ideal directory structure for project management. This is very
+useful as it enables you to have your analysis pipeline/workflow
+organised and structured. Together with the default RStudio project
+functionality and Git you will be able to keep track of your work as
+well as be able to share your work with collaborators.
For more information on ProjectTemplate and its functionality visit
+the home page ProjectTemplate
+
+
+
+
Make code readable
+
+
The most important part of writing code is making it readable and
+understandable. You want someone else to be able to pick up your code
+and be able to understand what it does: more often than not this someone
+will be you 6 months down the line, who will otherwise be cursing
+past-self.
+
Documentation: tell us what and why, not how
+
+
When you first start out, your comments will often describe what a
+command does, since you’re still learning yourself and it can help to
+clarify concepts and remind you later. However, these comments aren’t
+particularly useful later on when you don’t remember what problem your
+code is trying to solve. Try to also include comments that tell you
+why you’re solving a problem, and what problem that
+is. The how can come after that: it’s an implementation detail
+you ideally shouldn’t have to worry about.
+
Keep your code modular
+
+
Our recommendation is that you should separate your functions from
+your analysis scripts, and store them in a separate file that you
+source when you open the R session in your project. This
+approach is nice because it leaves you with an uncluttered analysis
+script, and a repository of useful functions that can be loaded into any
+analysis script in your project. It also lets you group related
+functions together easily.
+
Break down problem into bite size pieces
+
+
When you first start out, problem solving and function writing can be
+daunting tasks, and hard to separate from code inexperience. Try to
+break down your problem into digestible chunks and worry about the
+implementation details later: keep breaking down the problem into
+smaller and smaller functions until you reach a point where you can code
+a solution, and build back up from there.
+
Know that your code is doing the right thing
+
+
Make sure to test your functions!
+
Don’t repeat yourself
+
+
Functions enable easy reuse within a project. If you see blocks of
+similar lines of code through your project, those are usually candidates
+for being moved into functions.
+
If your calculations are performed through a series of functions,
+then the project becomes more modular and easier to change. This is
+especially the case for which a particular input always gives a
+particular output.
+
Remember to be stylish
+
+
Apply consistent style to your code.
+
+
+
+
+
+
Keypoints
+
+
+
Keep your project folder structured, organized and tidy.
+
Document what and why, not how.
+
Break programs into short single-purpose functions.
+
Write re-runnable tests.
+
Don’t repeat yourself.
+
Be consistent in naming, indentation, and other aspects of
+style.
+
+
diff --git a/404.html b/404.html
new file mode 100644
index 000000000..2c0bde5ad
--- /dev/null
+++ b/404.html
@@ -0,0 +1,451 @@
+
+R for Reproducible Scientific Analysis: Page not found
+ Skip to main content
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ R for Reproducible Scientific Analysis
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Page not found
+
+
Our apologies!
+
+
We cannot seem to find the page you are looking for. Here are some
+tips that may help:
to Share—copy and redistribute the material in any
+medium or format
+
to Adapt—remix, transform, and build upon the
+material
+
for any purpose, even commercially.
+
The licensor cannot revoke these freedoms as long as you follow the
+license terms.
+
Under the following terms:
+
Attribution—You must give appropriate credit
+(mentioning that your work is derived from work that is Copyright (c)
+The Carpentries and, where practical, linking to https://carpentries.org/), provide a link to the
+license, and indicate if changes were made. You may do so in any
+reasonable manner, but not in any way that suggests the licensor
+endorses you or your use.
+
No additional restrictions—You may not apply
+legal terms or technological measures that legally restrict others from
+doing anything the license permits. With the understanding
+that:
+
Notices:
+
You do not have to comply with the license for elements of the
+material in the public domain or where your use is permitted by an
+applicable exception or limitation.
+
No warranties are given. The license may not give you all of the
+permissions necessary for your intended use. For example, other rights
+such as publicity, privacy, or moral rights may limit how you use the
+material.
+
Software
+
+
Except where otherwise noted, the example programs and other software
+provided by The Carpentries are made available under the OSI-approved MIT
+license.
+
Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+“Software”), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
Trademark
+
+
“The Carpentries”, “Software Carpentry”, “Data Carpentry”, and
+“Library Carpentry” and their respective logos are registered trademarks
+of Community Initiatives.
Describe the purpose and use of each pane in the RStudio IDE
+
Locate buttons and options in the RStudio IDE
+
Define a variable
+
Assign data to a variable
+
Manage a workspace in an interactive R session
+
Use mathematical and comparison operators
+
Call functions
+
Manage packages
+
+
+
+
+
+
+
Motivation
+
+
+
Science is a multi-step process: once you’ve designed an experiment
+and collected data, the real fun begins! This lesson will teach you how
+to start this process using R and RStudio. We will begin with raw data,
+perform exploratory analyses, and learn how to plot results graphically.
+This example starts with a dataset from gapminder.org containing population
+information for many countries through time. Can you read the data into
+R? Can you plot the population for Senegal? Can you calculate the
+average income for countries on the continent of Asia? By the end of
+these lessons you will be able to do things like plot the populations
+for all of these countries in under a minute!
+
Before Starting The Workshop
+
+
+
Please ensure you have the latest version of R and RStudio installed
+on your machine. This is important, as some packages used in the
+workshop may not install correctly (or at all) if R is not up to
+date.
Welcome to the R portion of the Software Carpentry workshop.
+
Throughout this lesson, we’re going to teach you some of the
+fundamentals of the R language as well as some best practices for
+organizing code for scientific projects that will make your life
+easier.
+
We’ll be using RStudio: a free, open-source R Integrated Development
+Environment (IDE). It provides a built-in editor, works on all platforms
+(including on servers) and provides many advantages such as integration
+with version control and project management.
+
Basic layout
+
When you first open RStudio, you will be greeted by three panels:
+
+
The interactive R console/Terminal (entire left)
+
Environment/History/Connections (tabbed in upper right)
+
Files/Plots/Packages/Help/Viewer (tabbed in lower right)
+
+
Once you open files, such as R scripts, an editor panel will also
+open in the top left.
+
+
+
+
+
+
R scripts
+
+
+
Any commands that you write in the R console can be saved to a file
+to be re-run again. Files containing R code to be ran in this way are
+called R scripts. R scripts have .R at the end of their
+names to let you know what they are.
+
+
+
+
Workflow within RStudio
+
+
+
There are two main ways one can work within RStudio:
+
+
Test and play within the interactive R console then copy code into a
+.R file to run later.
+
+
+
This works well when doing small tests and initially starting
+off.
+
It quickly becomes laborious
+
+
+
Start writing in a .R file and use RStudio’s short cut keys for the
+Run command to push the current line, selected lines or modified lines
+to the interactive R console.
+
+
+
This is a great way to start; all your code is saved for later
+
You will be able to run the file you create from within RStudio or
+using R’s source() function.
+
+
+
+
+
+
+
Tip: Running segments of your code
+
+
+
RStudio offers you great flexibility in running code from within the
+editor window. There are buttons, menu choices, and keyboard shortcuts.
+To run the current line, you can
+
+
click on the Run button above the editor panel, or
+
select “Run Lines” from the “Code” menu, or
+
hit Ctrl+Return in Windows or Linux or
+⌘+Return on OS X. (This shortcut can also be seen
+by hovering the mouse over the button). To run a block of code, select
+it and then Run. If you have modified a line of code within
+a block of code you have just run, there is no need to reselect the
+section and Run, you can use the next button along,
+Re-run the previous region. This will run the previous code
+block including the modifications you have made.
+
+
+
+
+
Introduction to R
+
+
+
Much of your time in R will be spent in the R interactive console.
+This is where you will run all of your code, and can be a useful
+environment to try out ideas before adding them to an R script file.
+This console in RStudio is the same as the one you would get if you
+typed in R in your command-line environment.
+
The first thing you will see in the R interactive session is a bunch
+of information, followed by a “>” and a blinking cursor. In many ways
+this is similar to the shell environment you learned about during the
+shell lessons: it operates on the same idea of a “Read, evaluate, print
+loop”: you type in commands, R tries to execute them, and then returns a
+result.
+
Using R as a calculator
+
+
+
The simplest thing you could do with R is to do arithmetic:
+
+
R
+
+
+1+100
+
+
+
OUTPUT
+
+
[1] 101
+
+
And R will print out the answer, with a preceding “[1]”. [1] is the
+index of the first element of the line being printed in the console. For
+more information on indexing vectors, see Episode
+6: Subsetting Data.
+
If you type in an incomplete command, R will wait for you to complete
+it. If you are familiar with Unix Shell’s bash, you may recognize
+this
+behavior from bash.
+
+
R
+
+
>1+
+
+
+
OUTPUT
+
+
+
+
+
Any time you hit return and the R session shows a “+” instead of a
+“>”, it means it’s waiting for you to complete the command. If you
+want to cancel a command you can hit Esc and RStudio will
+give you back the “>” prompt.
+
+
+
+
+
+
Tip: Canceling commands
+
+
+
If you’re using R from the command line instead of from within
+RStudio, you need to use Ctrl+C instead of
+Esc to cancel the command. This applies to Mac users as
+well!
+
Canceling a command isn’t only useful for killing incomplete
+commands: you can also use it to tell R to stop running code (for
+example if it’s taking much longer than you expect), or to get rid of
+the code you’re currently writing.
+
+
+
+
When using R as a calculator, the order of operations is the same as
+you would have learned back in school.
+
From highest to lowest precedence:
+
+
Parentheses: (, )
+
+
Exponents: ^ or **
+
+
Multiply: *
+
+
Divide: /
+
+
Add: +
+
+
Subtract: -
+
+
+
+
R
+
+
+3+5*2
+
+
+
OUTPUT
+
+
[1] 13
+
+
Use parentheses to group operations in order to force the order of
+evaluation if it differs from the default, or to make clear what you
+intend.
+
+
R
+
+
+(3+5)*2
+
+
+
OUTPUT
+
+
[1] 16
+
+
This can get unwieldy when not needed, but clarifies your intentions.
+Remember that others may later read your code.
+
+
R
+
+
+(3+(5*(2^2)))# hard to read
+3+5*2^2# clear, if you remember the rules
+3+5*(2^2)# if you forget some rules, this might help
+
+
The text after each line of code is called a “comment”. Anything that
+follows after the hash (or octothorpe) symbol # is ignored
+by R when it executes code.
+
Really small or large numbers get a scientific notation:
+
+
R
+
+
+2/10000
+
+
+
OUTPUT
+
+
[1] 2e-04
+
+
Which is shorthand for “multiplied by 10^XX”. So
+2e-4 is shorthand for 2 * 10^(-4).
+
You can write numbers in scientific notation too:
+
+
R
+
+
+5e3# Note the lack of minus here
+
+
+
OUTPUT
+
+
[1] 5000
+
+
Mathematical functions
+
+
+
R has many built in mathematical functions. To call a function, we
+can type its name, followed by open and closing parentheses. Functions
+take arguments as inputs, anything we type inside the parentheses of a
+function is considered an argument. Depending on the function, the
+number of arguments can vary from none to multiple. For example:
+
+
R
+
+
+getwd()#returns an absolute filepath
+
+
doesn’t require an argument, whereas for the next set of mathematical
+functions we will need to supply the function a value in order to
+compute the result.
+
+
R
+
+
+sin(1)# trigonometry functions
+
+
+
OUTPUT
+
+
[1] 0.841471
+
+
+
R
+
+
+log(1)# natural logarithm
+
+
+
OUTPUT
+
+
[1] 0
+
+
+
R
+
+
+log10(10)# base-10 logarithm
+
+
+
OUTPUT
+
+
[1] 1
+
+
+
R
+
+
+exp(0.5)# e^(1/2)
+
+
+
OUTPUT
+
+
[1] 1.648721
+
+
Don’t worry about trying to remember every function in R. You can
+look them up on Google, or if you can remember the start of the
+function’s name, use the tab completion in RStudio.
+
This is one advantage that RStudio has over R on its own, it has
+auto-completion abilities that allow you to more easily look up
+functions, their arguments, and the values that they take.
+
Typing a ? before the name of a command will open the
+help page for that command. When using RStudio, this will open the
+‘Help’ pane; if using R in the terminal, the help page will open in your
+browser. The help page will include a detailed description of the
+command and how it works. Scrolling to the bottom of the help page will
+usually show a collection of code examples which illustrate command
+usage. We’ll go through an example later.
+
Comparing things
+
+
+
We can also do comparisons in R:
+
+
R
+
+
+1==1# equality (note two equals signs, read as "is equal to")
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1!=2# inequality (read as "is not equal to")
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1<2# less than
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1<=1# less than or equal to
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1>0# greater than
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1>=-9# greater than or equal to
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
+
+
+
+
Tip: Comparing Numbers
+
+
+
A word of warning about comparing numbers: you should never use
+== to compare two numbers unless they are integers (a data
+type which can specifically represent only whole numbers).
+
Computers may only represent decimal numbers with a certain degree of
+precision, so two numbers which look the same when printed out by R, may
+actually have different underlying representations and therefore be
+different by a small margin of error (called Machine numeric
+tolerance).
We can store values in variables using the assignment operator
+<-, like this:
+
+
R
+
+
+x<-1/40
+
+
Notice that assignment does not print a value. Instead, we stored it
+for later in something called a variable.
+x now contains the value
+0.025:
+
+
R
+
+
+x
+
+
+
OUTPUT
+
+
[1] 0.025
+
+
More precisely, the stored value is a decimal approximation
+of this fraction called a floating point
+number.
+
Look for the Environment tab in the top right panel of
+RStudio, and you will see that x and its value have
+appeared. Our variable x can be used in place of a number
+in any calculation that expects a number:
+
+
R
+
+
+log(x)
+
+
+
OUTPUT
+
+
[1] -3.688879
+
+
Notice also that variables can be reassigned:
+
+
R
+
+
+x<-100
+
+
x used to contain the value 0.025 and now it has the
+value 100.
+
Assignment values can contain the variable being assigned to:
+
+
R
+
+
+x<-x+1#notice how RStudio updates its description of x on the top right tab
+y<-x*2
+
+
The right hand side of the assignment can be any valid R expression.
+The right hand side is fully evaluated before the assignment
+occurs.
+
Variable names can contain letters, numbers, underscores and periods
+but no spaces. They must start with a letter or a period followed by a
+letter (they cannot start with a number nor an underscore). Variables
+beginning with a period are hidden variables. Different people use
+different conventions for long variable names, these include
+
+
periods.between.words
+
underscores_between_words
+
camelCaseToSeparateWords
+
+
What you use is up to you, but be consistent.
+
It is also possible to use the = operator for
+assignment:
+
+
R
+
+
+x=1/40
+
+
But this is much less common among R users. The most important thing
+is to be consistent with the operator you use. There
+are occasionally places where it is less confusing to use
+<- than =, and it is the most common symbol
+used in the community. So the recommendation is to use
+<-.
+
+
+
+
+
+
Challenge 1
+
+
+
Which of the following are valid R variable names?
The following will not be able to be used to create a variable
+
+
R
+
+
_age
+min-length
+2widths
+
+
+
+
+
+
Vectorization
+
+
+
One final thing to be aware of is that R is vectorized,
+meaning that variables and functions can have vectors as values. In
+contrast to physics and mathematics, a vector in R describes a set of
+values in a certain order of the same data type. For example
+
+
R
+
+
+1:5
+
+
+
OUTPUT
+
+
[1] 1 2 3 4 5
+
+
+
R
+
+
+2^(1:5)
+
+
+
OUTPUT
+
+
[1] 2 4 8 16 32
+
+
+
R
+
+
+x<-1:5
+2^x
+
+
+
OUTPUT
+
+
[1] 2 4 8 16 32
+
+
This is incredibly powerful; we will discuss this further in an
+upcoming lesson.
+
Managing your environment
+
+
+
There are a few useful commands you can use to interact with the R
+session.
+
ls will list all of the variables and functions stored
+in the global environment (your working R session):
+
+
R
+
+
+ls()
+
+
+
OUTPUT
+
+
[1] "x" "y"
+
+
+
+
+
+
+
Tip: hidden objects
+
+
+
Like in the shell, ls will hide any variables or
+functions starting with a “.” by default. To list all objects, type
+ls(all.names=TRUE) instead
+
+
+
+
Note here that we didn’t give any arguments to ls, but
+we still needed to give the parentheses to tell R to call the
+function.
+
If we type ls by itself, R prints a bunch of code
+instead of a listing of objects.
+
+
R
+
+
+ls
+
+
+
OUTPUT
+
+
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
+ pattern, sorted = TRUE)
+{
+ if (!missing(name)) {
+ pos <- tryCatch(name, error = function(e) e)
+ if (inherits(pos, "error")) {
+ name <- substitute(name)
+ if (!is.character(name))
+ name <- deparse(name)
+ warning(gettextf("%s converted to character string",
+ sQuote(name)), domain = NA)
+ pos <- name
+ }
+ }
+ all.names <- .Internal(ls(envir, all.names, sorted))
+ if (!missing(pattern)) {
+ if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
+ ll != length(grep("]", pattern, fixed = TRUE))) {
+ if (pattern == "[") {
+ pattern <- "\\["
+ warning("replaced regular expression pattern '[' by '\\\\['")
+ }
+ else if (length(grep("[^\\\\]\\[<-", pattern))) {
+ pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+ warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+ }
+ }
+ grep(pattern, all.names, value = TRUE)
+ }
+ else all.names
+}
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>
+
+
What’s going on here?
+
Like everything in R, ls is the name of an object, and
+entering the name of an object by itself prints the contents of the
+object. The object x that we created earlier contains 1, 2,
+3, 4, 5:
+
+
R
+
+
+x
+
+
+
OUTPUT
+
+
[1] 1 2 3 4 5
+
+
The object ls contains the R code that makes the
+ls function work! We’ll talk more about how functions work
+and start writing our own later.
+
You can use rm to delete objects you no longer need:
+
+
R
+
+
+rm(x)
+
+
If you have lots of things in your environment and want to delete all
+of them, you can pass the results of ls to the
+rm function:
+
+
R
+
+
+rm(list =ls())
+
+
In this case we’ve combined the two. Like the order of operations,
+anything inside the innermost parentheses is evaluated first, and so
+on.
+
In this case we’ve specified that the results of ls
+should be used for the list argument in rm.
+When assigning values to arguments by name, you must use the
+= operator!!
+
If instead we use <-, there will be unintended side
+effects, or you may get an error message:
+
+
R
+
+
+rm(list<-ls())
+
+
+
ERROR
+
+
Error in rm(list <- ls()): ... must contain names or character strings
+
+
+
+
+
+
+
Tip: Warnings vs. Errors
+
+
+
Pay attention when R does something unexpected! Errors, like above,
+are thrown when R cannot proceed with a calculation. Warnings on the
+other hand usually mean that the function has run, but it probably
+hasn’t worked as expected.
+
In both cases, the message that R prints out usually give you clues
+how to fix a problem.
+
+
+
+
R Packages
+
+
+
It is possible to add functions to R by writing a package, or by
+obtaining a package written by someone else. As of this writing, there
+are over 10,000 packages available on CRAN (the comprehensive R archive
+network). R and RStudio have functionality for managing packages:
+
+
You can see what packages are installed by typing
+installed.packages()
+
+
You can install packages by typing
+install.packages("packagename"), where
+packagename is the package name, in quotes.
+
You can update installed packages by typing
+update.packages()
+
+
You can remove a package with
+remove.packages("packagename")
+
+
You can make a package available for use with
+library(packagename)
+
+
+
Packages can also be viewed, loaded, and detached in the Packages tab
+of the lower right panel in RStudio. Clicking on this tab will display
+all of the installed packages with a checkbox next to them. If the box
+next to a package name is checked, the package is loaded and if it is
+empty, the package is not loaded. Click an empty box to load that
+package and click a checked box to detach that package.
+
Packages can be installed and updated from the Package tab with the
+Install and Update buttons at the top of the tab.
+
+
+
+
+
+
Challenge 2
+
+
+
What will be the value of each variable after each statement in the
+following program?
The scientific process is naturally incremental, and many projects
+start life as random notes, some code, then a manuscript, and eventually
+everything is a bit mixed together.
+
+
+Managing your projects in a reproducible fashion doesn’t just make your
+science reproducible, it makes your life easier.
+
Most people tend to organize their projects like this:
+
There are many reasons why we should ALWAYS avoid this:
+
+
It is really hard to tell which version of your data is the original
+and which is the modified;
+
It gets really messy because it mixes files with various extensions
+together;
+
It probably takes you a lot of time to actually find things, and
+relate the correct figures to the exact code that has been used to
+generate it;
+
+
A good project layout will ultimately make your life easier:
+
+
It will help ensure the integrity of your data;
+
It makes it simpler to share your code with someone else (a
+lab-mate, collaborator, or supervisor);
+
It allows you to easily upload your code with your manuscript
+submission;
+
It makes it easier to pick the project back up after a break.
+
A possible solution
+
+
+
Fortunately, there are tools and packages which can help you manage
+your work effectively.
+
One of the most powerful and useful aspects of RStudio is its project
+management functionality. We’ll be using this today to create a
+self-contained, reproducible project.
+
+
+
+
+
+
Challenge 1: Creating a self-contained
+project
+
+
+
We’re going to create a new project in RStudio:
+
+
Click the “File” menu button, then “New Project”.
+
Click “New Directory”.
+
Click “New Project”.
+
Type in the name of the directory to store your project,
+e.g. “my_project”.
+
If available, select the checkbox for “Create a git
+repository.”
+
Click the “Create Project” button.
+
+
+
+
+
The simplest way to open an RStudio project once it has been created
+is to click through your file system to get to the directory where it
+was saved and double click on the .Rproj file. This will
+open RStudio and start your R session in the same directory as the
+.Rproj file. All your data, plots and scripts will now be
+relative to the project directory. RStudio projects have the added
+benefit of allowing you to open multiple projects at the same time each
+open to its own project directory. This allows you to keep multiple
+projects open without them interfering with each other.
+
+
+
+
+
+
Challenge 2: Opening an RStudio project
+through the file system
+
+
+
+
Exit RStudio.
+
Navigate to the directory where you created a project in Challenge
+1.
+
Double click on the .Rproj file in that directory.
+
+
+
+
+
Best practices for project organization
+
+
+
Although there is no “best” way to lay out a project, there are some
+general principles to adhere to that will make project management
+easier:
+
+
Treat data as read only
+
+
This is probably the most important goal of setting up a project.
+Data is typically time consuming and/or expensive to collect. Working
+with them interactively (e.g., in Excel) where they can be modified
+means you are never sure of where the data came from, or how it has been
+modified since collection. It is therefore a good idea to treat your
+data as “read-only”.
+
+
+
Data Cleaning
+
+
In many cases your data will be “dirty”: it will need significant
+preprocessing to get into a format R (or any other programming language)
+will find useful. This task is sometimes called “data munging”. Storing
+these scripts in a separate folder, and creating a second “read-only”
+data folder to hold the “cleaned” data sets can prevent confusion
+between the two sets.
+
+
+
Treat generated output as disposable
+
+
Anything generated by your scripts should be treated as disposable:
+it should all be able to be regenerated from your scripts.
+
There are lots of different ways to manage this output. Having an
+output folder with different sub-directories for each separate analysis
+makes it easier later. Since many analyses are exploratory and don’t end
+up being used in the final project, and some of the analyses get shared
+between projects.
+
+
+
+
+
+
Tip: Good Enough Practices for Scientific
+Computing
+
Put each project in its own directory, which is named after the
+project.
+
Put text documents associated with the project in the
+doc directory.
+
Put raw data and metadata in the data directory, and
+files generated during cleanup and analysis in a results
+directory.
+
Put source for the project’s scripts and programs in the
+src directory, and programs brought in from elsewhere or
+compiled locally in the bin directory.
+
Name all files to reflect their content or function.
+
+
+
+
+
+
+
Separate function definition and application
+
+
One of the more effective ways to work with R is to start by writing
+the code you want to run directly in a .R script, and then running the
+selected lines (either using the keyboard shortcuts in RStudio or
+clicking the “Run” button) in the interactive R console.
+
When your project is in its early stages, the initial .R script file
+usually contains many lines of directly executed code. As it matures,
+reusable chunks get pulled into their own functions. It’s a good idea to
+separate these functions into two separate folders; one to store useful
+functions that you’ll reuse across analyses and projects, and one to
+store the analysis scripts.
+
+
+
Save the data in the data directory
+
+
Now we have a good directory structure we will now place/save the
+data file in the data/ directory.
Download the file (right mouse click on the link above -> “Save
+link as” / “Save file as”, or click on the link and after the page
+loads, press Ctrl+S or choose File -> “Save
+page as”)
+
Make sure it’s saved under the name
+gapminder_data.csv
+
+
Save the file in the data/ folder within your
+project.
+
+
We will load and inspect these data later.
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
It is useful to get some general idea about the dataset, directly
+from the command line, before loading it into R. Understanding the
+dataset better will come in handy when making decisions on how to load
+it in R. Use the command-line shell to answer the following
+questions:
+
+
What is the size of the file?
+
How many rows of data does it contain?
+
What kinds of values are stored in this file?
+
+
+
+
+
+
+
+
+
+
By running these commands in the shell:
+
+
SH
+
+
ls-lh data/gapminder_data.csv
+
+
+
OUTPUT
+
+
-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv
The Terminal tab in the console pane provides a convenient place
+directly within RStudio to interact directly with the command line.
+
+
+
+
+
+
Working directory
+
+
Knowing R’s current working directory is important because when you
+need to access other files (for example, to import a data file), R will
+look for them relative to the current working directory.
+
Each time you create a new RStudio Project, it will create a new
+directory for that project. When you open an existing
+.Rproj file, it will open that project and set R’s working
+directory to the folder that file is in.
+
+
+
+
+
+
Challenge 5
+
+
+
You can check the current working directory with the
+getwd() command, or by using the menus in RStudio.
+
+
In the console, type getwd() (“wd” is short for
+“working directory”) and hit Enter.
+
In the Files pane, double click on the data folder to
+open it (or navigate to any other folder you wish). To get the Files
+pane back to the current working directory, click “More” and then select
+“Go To Working Directory”.
+
+
You can change the working directory with setwd(), or by
+using RStudio menus.
+
+
In the console, type setwd("data") and hit Enter. Type
+getwd() and hit Enter to see the new working
+directory.
+
In the menus at the top of the RStudio window, click the “Session”
+menu button, and then select “Set Working Directory” and then “Choose
+Directory”. Next, in the windows navigator that opens, navigate back to
+the project directory, and click “Open”. Note that a setwd
+command will automatically appear in the console.
+
+
+
+
+
+
+
+
+
+
Tip: File does not exist errors
+
+
+
When you’re attempting to reference a file in your R code and you’re
+getting errors saying the file doesn’t exist, it’s a good idea to check
+your working directory. You need to either provide an absolute path to
+the file, or you need to make sure the file is saved in the working
+directory (or a subfolder of the working directory) and provide a
+relative path.
To be able to read R help files for functions and special
+operators.
+
To be able to use CRAN task views to identify packages to solve a
+problem.
+
To be able to seek help from your peers.
+
+
+
+
+
+
+
Reading Help Files
+
+
+
R, and every package, provide help files for functions. The general
+syntax to search for help on any function, “function_name”, from a
+specific function that is in a package loaded into your namespace (your
+interactive R session) is:
+
+
R
+
+
+?function_name
+help(function_name)
+
+
For example take a look at the help file for
+write.table(), we will be using a similar function in an
+upcoming episode.
+
+
R
+
+
+?write.table()
+
+
This will load up a help page in RStudio (or as plain text in R
+itself).
+
Each help page is broken down into sections:
+
+
Description: An extended description of what the function does.
+
Usage: The arguments of the function and their default values (which
+can be changed).
+
Arguments: An explanation of the data each argument is
+expecting.
+
Details: Any important details to be aware of.
+
Value: The data the function returns.
+
See Also: Any related functions you might find useful.
+
Examples: Some examples for how to use the function.
+
+
Different functions might have different sections, but these are the
+main ones you should be aware of.
+
Notice how related functions might call for the same help file:
+
+
R
+
+
+?write.table()
+?write.csv()
+
+
This is because these functions have very similar applicability and
+often share the same arguments as inputs to the function, so package
+authors often choose to document them together in a single help
+file.
+
+
+
+
+
+
Tip: Running Examples
+
+
+
From within the function help page, you can highlight code in the
+Examples and hit Ctrl+Return to run it in RStudio
+console. This gives you a quick way to get a feel for how a function
+works.
+
+
+
+
+
+
+
+
+
Tip: Reading Help Files
+
+
+
One of the most daunting aspects of R is the large number of
+functions available. It would be prohibitive, if not impossible to
+remember the correct usage for every function you use. Luckily, using
+the help files means you don’t have to remember that!
+
+
+
+
Special Operators
+
+
+
To seek help on special operators, use quotes or backticks:
+
+
R
+
+
+?"<-"
+?`<-`
+
+
Getting Help with Packages
+
+
+
Many packages come with “vignettes”: tutorials and extended example
+documentation. Without any arguments, vignette() will list
+all vignettes for all installed packages;
+vignette(package="package-name") will list all available
+vignettes for package-name, and
+vignette("vignette-name") will open the specified
+vignette.
+
If a package doesn’t have any vignettes, you can usually find help by
+typing help("package-name").
+
RStudio also has a set of excellent cheatsheets for
+many packages.
+
When You Remember Part of the Function Name
+
+
+
If you’re not sure what package a function is in or how it’s
+specifically spelled, you can do a fuzzy search:
+
+
R
+
+
+??function_name
+
+
A fuzzy search is when you search for an approximate string match.
+For example, you may remember that the function to set your working
+directory includes “set” in its name. You can do a fuzzy search to help
+you identify the function:
+
+
R
+
+
+??set
+
+
When You Have No Idea Where to Begin
+
+
+
If you don’t know what function or package you need to use CRAN Task Views is a
+specially maintained list of packages grouped into fields. This can be a
+good starting point.
+
When Your Code Doesn’t Work: Seeking Help from Your Peers
+
+
+
If you’re having trouble using a function, 9 times out of 10, the
+answers you seek have already been answered on Stack Overflow. You can search
+using the [r] tag. Please make sure to see their page on how to ask a good
+question.
+
If you can’t find the answer, there are a few useful functions to
+help you ask your peers:
+
+
R
+
+
+?dput
+
+
Will dump the data you’re working with into a format that can be
+copied and pasted by others into their own R session.
+
+
R
+
+
+sessionInfo()
+
+
+
OUTPUT
+
+
R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
+ [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
+ [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
+[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
+
+time zone: UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1 tools_4.3.1 rstudioapi_0.15.0 yaml_2.3.7
+[5] knitr_1.43 xfun_0.40 renv_1.0.3 evaluate_0.21
+
+
Will print out your current version of R, as well as any packages you
+have loaded. This can be useful for others to help reproduce and debug
+your issue.
+
+
+
+
+
+
Challenge 1
+
+
+
Look at the help page for the c function. What kind of
+vector do you expect will be created if you evaluate the following:
+
+
R
+
+
+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+
+
+
+
+
+
+
+
+
+
The c() function creates a vector, in which all elements
+are of the same type. In the first case, the elements are numeric, in
+the second, they are characters, and in the third they are also
+characters: the numeric values are “coerced” to be characters.
+
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Look at the help for the paste function. You will need
+to use it later. What’s the difference between the sep and
+collapse arguments?
+
+
+
+
+
+
+
+
+
To look at the help for the paste() function, use:
+
+
R
+
+
+help("paste")
+?paste
+
+
The difference between sep and collapse is
+a little tricky. The paste function accepts any number of
+arguments, each of which can be a vector of any length. The
+sep argument specifies the string used between concatenated
+terms — by default, a space. The result is a vector as long as the
+longest argument supplied to paste. In contrast,
+collapse specifies that after concatenation the elements
+are collapsed together using the given separator, the result
+being a single string.
+
It is important to call the arguments explicitly by typing out the
+argument name e.g sep = "," so the function understands to
+use the “,” as a separator and not a term to concatenate. e.g.
+
+
R
+
+
+paste(c("a","b"), "c")
+
+
+
OUTPUT
+
+
[1] "a c" "b c"
+
+
+
R
+
+
+paste(c("a","b"), "c", ",")
+
+
+
OUTPUT
+
+
[1] "a c ," "b c ,"
+
+
+
R
+
+
+paste(c("a","b"), "c", sep =",")
+
+
+
OUTPUT
+
+
[1] "a,c" "b,c"
+
+
+
R
+
+
+paste(c("a","b"), "c", collapse ="|")
+
+
+
OUTPUT
+
+
[1] "a c|b c"
+
+
+
R
+
+
+paste(c("a","b"), "c", sep =",", collapse ="|")
+
+
+
OUTPUT
+
+
[1] "a,c|b,c"
+
+
(For more information, scroll to the bottom of the
+?paste help page and look at the examples, or try
+example('paste').)
+
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Use help to find a function (and its associated parameters) that you
+could use to load data from a tabular file in which columns are
+delimited with “\t” (tab) and the decimal point is a “.” (period). This
+check for decimal separator is important, especially if you are working
+with international colleagues, because different countries have
+different conventions for the decimal point (i.e. comma vs period).
+Hint: use ??"read table" to look up functions related to
+reading in tabular data.
+
+
+
+
+
+
+
+
+
The standard R function for reading tab-delimited files with a period
+decimal separator is read.delim(). You can also do this with
+read.table(file, sep="\t") (the period is the
+default decimal separator for read.table()),
+although you may have to change the comment.char argument
+as well if your data file contains hash (#) characters.
To begin exploring data frames, and understand how they are related
+to vectors and lists.
+
To be able to ask questions from R about the type, class, and
+structure of an object.
+
To understand the information of the attributes “names”, “class”,
+and “dim”.
+
+
+
+
+
+
+
One of R’s most powerful features is its ability to deal with tabular
+data - such as you may already have in a spreadsheet or a CSV file.
+Let’s start by making a toy dataset in your data/
+directory, called feline-data.csv:
We can now save cats as a CSV file. It is good practice
+to call the argument names explicitly so the function knows what default
+values you are changing. Here we are setting
+row.names = FALSE. Recall you can use
+?write.csv to pull up the help file to check out the
+argument names and their default values.
The read.table function is used for reading in tabular
+data stored in a text file where the columns of data are separated by
+punctuation characters such as CSV files (csv = comma-separated values).
+Tabs and commas are the most common punctuation characters used to
+separate or delimit data points in csv files. For convenience R provides
+2 other versions of read.table. These are:
+read.csv for files where the data are separated with commas
+and read.delim for files where the data are separated with
+tabs. Of these three functions read.csv is the most
+commonly used. If needed it is possible to override the default
+delimiting punctuation marks for both read.csv and
+read.delim.
+
+
+
+
+
+
Check your data for factors
+
+
+
In recent times, the default way how R handles textual data has
+changed. Text data was interpreted by R automatically into a format
+called “factors”. But there is an easier format that is called
+“character”. We will hear about factors later, and what to use them for.
+For now, remember that in most cases, they are not needed and only
+complicate your life, which is why newer R versions read in text as
+“character”. Check now if your version of R has automatically created
+factors and convert them to “character” format:
+
+
Check the data types of your input by typing
+str(cats)
+
+
In the output, look at the three-letter codes after the colons: If
+you see only “num” and “chr”, you can continue with the lesson and skip
+this box. If you find “fct”, continue to step 3.
+
Prevent R from automatically creating “factor” data. That can be
+done by the following code:
+options(stringsAsFactors = FALSE). Then, re-read the cats
+table for the change to take effect.
+
You must set this option every time you restart R. To not forget
+this, include it in your analysis script before you read in any data,
+for example in one of the first lines.
+
For R versions greater than 4.0.0, text data is no longer converted
+to factors anymore. So you can install this or a newer version to avoid
+this problem. If you are working on an institute or company computer,
+ask your administrator to do it.
+
+
+
+
+
We can begin exploring our dataset right away, pulling out columns by
+specifying them using the $ operator:
+
+
R
+
+
+cats$weight
+
+
+
OUTPUT
+
+
[1] 2.1 5.0 3.2
+
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
We can do other operations on the columns:
+
+
R
+
+
+## Say we discovered that the scale weighs two Kg light:
+cats$weight+2
+
+
+
OUTPUT
+
+
[1] 4.1 7.0 5.2
+
+
+
R
+
+
+paste("My cat is", cats$coat)
+
+
+
OUTPUT
+
+
[1] "My cat is calico" "My cat is black" "My cat is tabby"
+
+
But what about
+
+
R
+
+
+cats$weight+cats$coat
+
+
+
ERROR
+
+
Error in cats$weight + cats$coat: non-numeric argument to binary operator
+
+
Understanding what happened here is key to successfully analyzing
+data in R.
+
+
Data Types
+
+
If you guessed that the last command will return an error because
+2.1 plus "black" is nonsense, you’re right -
+and you already have some intuition for an important concept in
+programming called data types. We can ask what type of data
+something is:
+
+
R
+
+
+typeof(cats$weight)
+
+
+
OUTPUT
+
+
[1] "double"
+
+
There are 5 main types: double, integer,
+complex, logical and character.
+For historic reasons, double is also called
+numeric.
+
+
R
+
+
+typeof(3.14)
+
+
+
OUTPUT
+
+
[1] "double"
+
+
+
R
+
+
+typeof(1L)# The L suffix forces the number to be an integer, since by default R uses float numbers
+
+
+
OUTPUT
+
+
[1] "integer"
+
+
+
R
+
+
+typeof(1+1i)
+
+
+
OUTPUT
+
+
[1] "complex"
+
+
+
R
+
+
+typeof(TRUE)
+
+
+
OUTPUT
+
+
[1] "logical"
+
+
+
R
+
+
+typeof('banana')
+
+
+
OUTPUT
+
+
[1] "character"
+
+
No matter how complicated our analyses become, all data in R is
+interpreted as one of these basic data types. This strictness has some
+really important consequences.
+
A user has added details of another cat. This information is in the
+file data/feline-data_v2.csv.
+
+
R
+
+
+file.show("data/feline-data_v2.csv")
+
+
+
R
+
+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+tabby,2.3 or 2.4,1
+
+
Load the new cats data like before, and check what type of data we
+find in the weight column:
Oh no, our weights aren’t the double type anymore! If we try to do
+the same math we did on them before, we run into trouble:
+
+
R
+
+
+cats$weight+2
+
+
+
ERROR
+
+
Error in cats$weight + 2: non-numeric argument to binary operator
+
+
What happened? The cats data we are working with is
+something called a data frame. Data frames are one of the most
+common and versatile types of data structures we will work with
+in R. A given column in a data frame cannot be composed of different
+data types. In this case, R does not read everything in the data frame
+column weight as a double, therefore the entire
+column data type changes to something that is suitable for everything in
+the column.
+
When R reads a csv file, it reads it in as a data frame.
+Thus, when we loaded the cats csv file, it is stored as a
+data frame. We can recognize data frames by the first row that is
+written by the str() function:
Data frames are composed of rows and columns, where each
+column has the same number of rows. Different columns in a data frame
+can be made up of different data types (this is what makes them so
+versatile), but everything in a given column needs to be the same type
+(e.g., vector, factor, or list).
+
Let’s explore more about different data structures and how they
+behave. For now, let’s remove that extra line from our cats data and
+reload it, while we investigate this behavior further:
To better understand this behavior, let’s meet another of the data
+structures: the vector.
+
+
R
+
+
+my_vector<-vector(length =3)
+my_vector
+
+
+
OUTPUT
+
+
[1] FALSE FALSE FALSE
+
+
A vector in R is essentially an ordered list of things, with the
+special condition that everything in the vector must be the same
+basic data type. If you don’t choose the datatype, it’ll default to
+logical; or, you can declare an empty vector of whatever
+type you like.
The somewhat cryptic output from this command indicates the basic
+data type found in this vector - in this case chr,
+character; an indication of the number of things in the vector -
+actually, the indexes of the vector, in this case [1:3];
+and a few examples of what’s actually in the vector - in this case empty
+character strings. If we similarly do
+
+
R
+
+
+str(cats$weight)
+
+
+
OUTPUT
+
+
num [1:3] 2.1 5 3.2
+
+
we see that cats$weight is a vector, too - the
+columns of data we load into R data.frames are all vectors, and
+that’s the root of why R forces everything in a column to be the same
+basic data type.
+
+
+
+
+
+
Discussion 1
+
+
+
Why is R so opinionated about what we put in our columns of data? How
+does this help us?
+
+
+
+
+
+
By keeping everything in a column the same, we allow ourselves to
+make simple assumptions about our data; if you can interpret one entry
+in the column as a number, then you can interpret all of them
+as numbers, so we don’t have to check every time. This consistency is
+what people mean when they talk about clean data; in the long
+run, strict consistency goes a long way to making our lives easier in
+R.
+
+
+
+
+
+
+
+
+
Coercion by combining vectors
+
+
You can also make vectors with explicit contents with the combine
+function:
+
+
R
+
+
+combine_vector<-c(2,6,3)
+combine_vector
+
+
+
OUTPUT
+
+
[1] 2 6 3
+
+
Given what we’ve learned so far, what do you think the following will
+produce?
+
+
R
+
+
+quiz_vector<-c(2,6,'3')
+
+
This is something called type coercion, and it is the source
+of many surprises and the reason why we need to be aware of the basic
+data types and how R will interpret them. When R encounters a mix of
+types (here double and character) to be combined into a single vector,
+it will force them all to be the same type. Consider:
The coercion rules go: logical ->
+integer -> double (“numeric”)
+-> complex -> character, where -> can
+be read as are transformed into. For example, combining
+logical and character transforms the result to
+character:
+
+
R
+
+
+c('a', TRUE)
+
+
+
OUTPUT
+
+
[1] "a" "TRUE"
+
+
A quick way to recognize character vectors is by the
+quotes that enclose them when they are printed.
+
You can try to force coercion against this flow using the
+as. functions:
As you can see, some surprising things can happen when R forces one
+basic data type into another! Nitty-gritty of type coercion aside, the
+point is: if your data doesn’t look like what you thought it was going
+to look like, type coercion may well be to blame; make sure everything
+is the same type in your vectors and your columns of data.frames, or you
+will get nasty surprises!
+
But coercion can also be very useful! For example, in our
+cats data likes_string is numeric, but we know
+that the 1s and 0s actually represent TRUE and
+FALSE (a common way of representing them). We should use
+the logical datatype here, which has two states:
+TRUE or FALSE, which is exactly what our data
+represents. We can ‘coerce’ this column to be logical by
+using the as.logical function:
An important part of every data analysis is cleaning the input data.
+If you know that the input data is all of the same format,
+(e.g. numbers), your analysis is much easier! Clean the cat data set
+from the chapter about type coercion.
+
+
Copy the code template
+
+
Create a new script in RStudio and copy and paste the following code.
+Then move on to the tasks below, which help you to fill in the gaps
+(______).
+
# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+
+# 1. Print the data
+_____
+
+# 2. Show an overview of the table with all data types
+_____(cats)
+
+# 3. The "weight" column has the incorrect data type __________.
+# The correct data type is: ____________.
+
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+# print the data again to see the effect
+cats
+
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+
+# Calculate the mean to test yourself
+mean(cats$weight)
+
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!
+
+
+
Instructions for the tasks
+
+
+
1. Print the data
+
+
Execute the first statement (read.csv(...)). Then print
+the data to the console
+
+
+
+
+
+
+
+
+
+
+
Show the content of any variable by typing its name.
+
+
Solution to Challenge 1.1
+
+
Two correct solutions:
+
cats
+print(cats)
+
+
+
+
+
+
+
+
+
+
+
2. Overview of the data types
+
+
+
The data type of your data is as important as the data itself. Use a
+function we saw earlier to print out the data types of all columns of
+the cats table.
+
+
+
+
+
+
+
+
+
In the chapter “Data types” we saw two functions that can show data
+types. One printed just a single word, the data type name. The other
+printed a short form of the data type, and the first few values. We need
+the second here.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to Challenge 1.2
+
str(cats)
+
+
+
3. Which data type do we need?
+
+
The shown data type is not the right one for this data (weight of a
+cat). Which data type do we need?
+
+
Why did the read.csv() function not choose the correct
+data type?
+
Fill in the gap in the comment with the correct data type for cat
+weight!
+
+
+
+
+
+
+
+
+
+
+
Scroll up to the section about the type
+hierarchy to review the available data types
+
+
+
+
+
+
+
+
+
+
+
Weight is expressed on a continuous scale (real numbers). The R data
+type for this is “double” (also known as “numeric”).
+
The fourth row has the value “2.3 or 2.4”. That is not a number but
+two, and an english word. Therefore, the “character” data type is
+chosen. The whole column is now text, because all values in the same
+columns have to be the same data type.
+
+
+
+
+
+
+
+
+
+
+
4. Correct the problematic value
+
+
+
The code to assign a new weight value to the problematic fourth row
+is given. Think first and then execute it: What will be the data type
+after assigning a number like in this example? You can check the data
+type after executing to see if you were right.
+
+
+
+
+
+
+
+
+
Revisit the hierarchy of data types when two different data types are
+combined.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to challenge 1.4
+
The data type of the column “weight” is “character”. The assigned
+data type is “double”. Combining two data types yields the data type
+that is higher in the following hierarchy:
+
logical < integer < double < complex < character
+
Therefore, the column is still of type character! We need to manually
+convert it to “double”. {: .solution}
+
+
+
5. Convert the column “weight” to the correct data type
+
+
Cat weight are numbers. But the column does not have this data type
+yet. Coerce the column to floating point numbers.
+
+
+
+
+
+
+
+
+
+
The functions to convert data types start with as.. You
+can look for the function further up in the manuscript or use the
+RStudio auto-complete function: Type “as.” and then press
+the TAB key.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to Challenge 1.5
+
There are two functions that are synonymous for historic reasons:
To change a single element, use the bracket on the other side of the
+arrow:
+
+
R
+
+
+sequence_example[1]<-30
+sequence_example
+
+
+
OUTPUT
+
+
[1] 30 21 22 23 24 25
+
+
+
+
+
+
+
Challenge 2
+
+
+
Start by making a vector with the numbers 1 through 26. Then,
+multiply the vector by 2.
+
+
+
+
+
+
+
+
+
+
R
+
+
+x<-1:26
+x<-x*2
+
+
+
+
+
+
+
+
Lists
+
+
Another data structure you’ll want in your bag of tricks is the
+list. A list is simpler in some ways than the other types,
+because you can put anything you want in it. Remember everything in
+the vector must be of the same basic data type, but a list can have
+different data types:
When printing the object structure with str(), we see
+the data types of all elements:
+
+
R
+
+
+str(list_example)
+
+
+
OUTPUT
+
+
List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i
+
+
What is the use of lists? They can organize data of different
+types. For example, you can organize different tables that
+belong together, similar to spreadsheets in Excel. But there are many
+other uses, too.
+
We will see another example that will maybe surprise you in the next
+chapter.
+
To retrieve one of the elements of a list, use the double
+bracket:
+
+
R
+
+
+list_example[[2]]
+
+
+
OUTPUT
+
+
[1] "a"
+
+
The elements of lists also can have names, they can
+be given by prepending them to the values, separated by an equals
+sign:
+
+
R
+
+
+another_list<-list(title ="Numbers", numbers =1:10, data =TRUE)
+another_list
This results in a named list. Now we have a new
+function of our object! We can access single elements by an additional
+way!
+
+
R
+
+
+another_list$title
+
+
+
OUTPUT
+
+
[1] "Numbers"
+
+
+
Names
+
+
+
With names, we can give meaning to elements. It is the first time
+that we do not only have the data, but also explaining
+information. It is metadata that can be stuck to the object
+like a label. In R, this is called an attribute. Some
+attributes enable us to do more with our object, for example, like here,
+accessing an element by a self-defined name.
+
+
Accessing vectors and lists by name
+
+
We have already seen how to generate a named list. The way to
+generate a named vector is very similar. You have seen this function
+before:
The way to retrieve elements is different, though:
+
+
R
+
+
+pizza_price["pizzasubito"]
+
+
+
OUTPUT
+
+
pizzasubito
+5.64
+
+
The approach used for the list does not work:
+
+
R
+
+
+pizza_price$pizzafresh
+
+
+
ERROR
+
+
Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors
+
+
It will pay off if you remember this error message, you will meet it
+in your own analyses. It means that you have just tried accessing an
+element like it was in a list, but it is actually in a vector.
+
+
+
Accessing and changing names
+
+
If you are only interested in the names, use the names()
+function:
+
+
R
+
+
+names(pizza_price)
+
+
+
OUTPUT
+
+
[1] "pizzasubito" "pizzafresh" "callapizza"
+
+
We have seen how to access and change single elements of a vector.
+The same is possible for names:
What is the data type of the names of pizza_price? You
+can find out using the str() or typeof()
+functions.
+
+
+
+
+
+
+
+
+
+
You get the names of an object by wrapping the object name inside
+names(...). Similarly, you get the data type of the names
+by again wrapping the whole code in typeof(...):
+
typeof(names(pizza))
+
alternatively, use a new variable if this is easier for you to
+read:
+
n<-names(pizza)
+typeof(n)
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Instead of just changing some of the names a vector/list already has,
+you can also set all names of an object by writing code like (replace
+ALL CAPS text):
+
names(OBJECT)<-CHARACTER_VECTOR
+
Create a vector that gives the number for each letter in the
+alphabet!
+
+
Generate a vector called letter_no with the sequence of
+numbers from 1 to 26!
+
R has a built-in object called LETTERS. It is a
+26-character vector, from A to Z. Set the names of the number sequence
+to this 26 letters
+
Test yourself by calling letter_no["B"], which should
+give you the number 2!
+
+
+
+
+
+
+
+
+
+
letter_no<-1:26# or seq(1,26)
+names(letter_no)<-LETTERS
+letter_no["B"]
+
+
+
+
+
+
Data frames
+
+
+
We have data frames at the very beginning of this lesson, they
+represent a table of data. We didn’t go much further into detail with
+our example cat data frame:
We can now understand something a bit surprising in our data.frame;
+what happens if we run:
+
+
R
+
+
+typeof(cats)
+
+
+
OUTPUT
+
+
[1] "list"
+
+
We see that data.frames look like lists ‘under the hood’. Think again
+what we heard about what lists can be used for:
+
+
Lists organize data of different types
+
+
Columns of a data frame are vectors of different types, that are
+organized by belonging to the same table.
+
A data.frame is really a list of vectors. It is a special list in
+which all the vectors must have the same length.
+
How is this “special”-ness written into the object, so that R does
+not treat it like any other list, but as a table?
+
+
R
+
+
+class(cats)
+
+
+
OUTPUT
+
+
[1] "data.frame"
+
+
A class, just like names, is an attribute attached
+to the object. It tells us what this object means for humans.
+
You might wonder: Why do we need another
+what-type-of-object-is-this-function? We already have
+typeof()? That function tells us how the object is
+constructed in the computer. The class is
+the meaning of the object for humans. Consequently,
+what typeof() returns is fixed in R (mainly the
+five data types), whereas the output of class() is
+diverse and extendable by R packages.
+
In our cats example, we have an integer, a double and a
+logical variable. As we have seen already, each column of data.frame is
+a vector.
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
+
R
+
+
+cats[,1]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
+
R
+
+
+typeof(cats[,1])
+
+
+
OUTPUT
+
+
[1] "character"
+
+
+
R
+
+
+str(cats[,1])
+
+
+
OUTPUT
+
+
chr [1:3] "calico" "black" "tabby"
+
+
Each row is an observation of different variables, itself a
+data.frame, and thus can be composed of elements of different types.
There are several subtly different ways to call variables,
+observations and elements from data.frames:
+
+
cats[1]
+
cats[[1]]
+
cats$coat
+
cats["coat"]
+
cats[1, 1]
+
cats[, 1]
+
cats[1, ]
+
+
Try out these examples and explain what is returned by each one.
+
Hint: Use the function typeof() to examine what
+is returned in each case.
+
+
+
+
+
+
+
+
+
+
R
+
+
+cats[1]
+
+
+
OUTPUT
+
+
coat
+1 calico
+2 black
+3 tabby
+
+
We can think of a data frame as a list of vectors. The single brace
+[1] returns the first slice of the list, as another list.
+In this case it is the first column of the data frame.
+
+
R
+
+
+cats[[1]]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
The double brace [[1]] returns the contents of the list
+item. In this case it is the contents of the first column, a
+vector of type character.
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
This example uses the $ character to address items by
+name. coat is the first column of the data frame, again a
+vector of type character.
+
+
R
+
+
+cats["coat"]
+
+
+
OUTPUT
+
+
coat
+1 calico
+2 black
+3 tabby
+
+
Here we are using a single brace ["coat"] replacing the
+index number with the column name. Like example 1, the returned object
+is a list.
+
+
R
+
+
+cats[1, 1]
+
+
+
OUTPUT
+
+
[1] "calico"
+
+
This example uses a single brace, but this time we provide row and
+column coordinates. The returned object is the value in row 1, column 1.
+The object is a vector of type character.
+
+
R
+
+
+cats[, 1]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
Like the previous example we use single braces and provide row and
+column coordinates. The row coordinate is not specified, R interprets
+this missing value as all the elements in this column and
+returns them as a vector.
+
+
R
+
+
+cats[1, ]
+
+
+
OUTPUT
+
+
coat weight likes_string
+1 calico 2.1 TRUE
+
+
Again we use the single brace with row and column coordinates. The
+column coordinate is not specified. The return value is a list
+containing all the values in the first row.
+
+
+
+
+
+
+
+
+
+
Tip: Renaming data frame columns
+
+
+
Data frames have column names, which can be accessed with the
+names() function.
+
+
R
+
+
+names(cats)
+
+
+
OUTPUT
+
+
[1] "coat" "weight" "likes_string"
+
+
If you want to rename the second column of cats, you can
+assign a new name to the second element of names(cats).
Because a matrix is a vector with added dimension attributes,
+length gives you the total number of elements in the
+matrix.
+
+
+
+
+
+
+
+
+
+
Challenge 7
+
+
+
Make another matrix, this time containing the numbers 1:50, with 5
+columns and 10 rows. Did the matrix function fill your
+matrix by column, or by row, as its default behaviour? See if you can
+figure out how to change this. (hint: read the documentation for
+matrix!)
+
+
+
+
+
+
+
+
+
Make another matrix, this time containing the numbers 1:50, with 5
+columns and 10 rows. Did the matrix function fill your
+matrix by column, or by row, as its default behaviour? See if you can
+figure out how to change this. (hint: read the documentation for
+matrix!)
+
+
R
+
+
+x<-matrix(1:50, ncol=5, nrow=10)
+x<-matrix(1:50, ncol=5, nrow=10, byrow =TRUE)# to fill by row
+
+
+
+
+
+
+
+
+
+
+
Challenge 8
+
+
+
Create a list of length two containing a character vector for each of
+the sections in this part of the workshop:
+
+
Data types
+
Data structures
+
+
Populate each character vector with the names of the data types and
+data structures we’ve seen so far.
Note: it’s nice to make a list in big writing on the board or taped
+to the wall listing all of these types and structures - leave it up for
+the rest of the workshop to remind people of the importance of these
+basics.
+
+
+
+
+
+
+
+
+
+
Challenge 9
+
+
+
Consider the R output of the matrix below:
+
+
OUTPUT
+
+
[,1] [,2]
+[1,] 4 1
+[2,] 9 5
+[3,] 10 7
+
+
What was the correct command used to write this matrix? Examine each
+command and try to figure out the correct one before typing them. Think
+about what matrices the other commands will produce.
What was the correct command used to write this matrix? Examine each
+command and try to figure out the correct one before typing them. Think
+about what matrices the other commands will produce.
Display basic properties of data frames including size and class of
+the columns, names, and first few rows.
+
+
+
+
+
+
+
At this point, you’ve seen it all: in the last lesson, we toured all
+the basic data types and data structures in R. Everything you do will be
+a manipulation of those tools. But most of the time, the star of the
+show is the data frame—the table that we created by loading information
+from a csv file. In this lesson, we’ll learn a few more things about
+working with data frames.
+
Adding columns and rows in data frames
+
+
+
We already learned that the columns of a data frame are vectors, so
+that our data are consistent in type throughout the columns. As such, if
+we want to add a new column, we can start by making a new vector:
coat weight likes_string age
+1 calico 2.1 1 2
+2 black 5.0 0 3
+3 tabby 3.2 1 5
+
+
Notice the comma with nothing after it to indicate that we want to
+drop the entire fourth row.
+
Note: we could also remove several rows at once by putting the row
+numbers inside of a vector, for example:
+cats[c(-3,-4), ]
+
Removing columns
+
+
+
We can also remove columns in our data frame. What if we want to
+remove the column “age”. We can remove it in two ways, by variable
+number or by index.
Notice the comma with nothing before it, indicating we want to keep
+all of the rows.
+
Alternatively, we can drop the column by using the index name and the
+%in% operator. The %in% operator goes through
+each element of its left argument, in this case the names of
+cats, and asks, “Does this element occur in the second
+argument?”
The key to remember when adding data to a data frame is that
+columns are vectors and rows are lists. We can also glue two
+data frames together with rbind:
You can create a new data frame right from within R with the
+following syntax:
+
+
R
+
+
+df<-data.frame(id =c("a", "b", "c"),
+ x =1:3,
+ y =c(TRUE, TRUE, FALSE))
+
+
Make a data frame that holds the following information for
+yourself:
+
+
first name
+
last name
+
lucky number
+
+
Then use rbind to add an entry for the people sitting
+beside you. Finally, use cbind to add a column with each
+person’s answer to the question, “Is it time for coffee break?”
So far, you have seen the basics of manipulating data frames with our
+cat data; now let’s use those skills to digest a more realistic dataset.
+Let’s read in the gapminder dataset that we downloaded
+previously:
+
+
R
+
+
+gapminder<-read.csv("data/gapminder_data.csv")
+
+
+
+
+
+
+
Miscellaneous Tips
+
+
+
+
Another type of file you might encounter are tab-separated value
+files (.tsv). To specify a tab as a separator, use "\\t" or
+read.delim().
+
Files can also be downloaded directly from the Internet into a
+local folder of your choice onto your computer using the
+download.file function. The read.csv function
+can then be executed to read the downloaded file from the download
+location, for example,
Alternatively, you can also read in files directly into R from the
+Internet by replacing the file paths with a web address in
+read.csv. One should note that in doing this no local copy
+of the csv file is first saved onto your computer. For example,
You can read directly from excel spreadsheets without converting
+them to plain text first by using the readxl
+package.
+
The argument “stringsAsFactors” can be useful to tell R how to
+read strings either as factors or as character strings. In R versions
+after 4.0, all strings are read-in as characters by default, but in
+earlier versions of R, strings are read-in as factors by default. For
+more information, see the call-out in the
+previous episode.
+
+
+
+
+
Let’s investigate gapminder a bit; the first thing we should always
+do is check out what the data looks like with str:
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+
+
An additional method for examining the structure of gapminder is to
+use the summary function. This function can be used on
+various objects in R. For data frames, summary yields a
+numeric, tabular, or descriptive summary of each column. Numeric or
+integer columns are described by the descriptive statistics (quartiles
+and mean), and character columns by its length, class, and mode.
+
+
R
+
+
+summary(gapminder)
+
+
+
OUTPUT
+
+
country year pop continent
+ Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
+ Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
+ Mode :character Median :1980 Median :7.024e+06 Mode :character
+ Mean :1980 Mean :2.960e+07
+ 3rd Qu.:1993 3rd Qu.:1.959e+07
+ Max. :2007 Max. :1.319e+09
+ lifeExp gdpPercap
+ Min. :23.60 Min. : 241.2
+ 1st Qu.:48.20 1st Qu.: 1202.1
+ Median :60.71 Median : 3531.8
+ Mean :59.47 Mean : 7215.3
+ 3rd Qu.:70.85 3rd Qu.: 9325.5
+ Max. :82.60 Max. :113523.1
+
+
Along with the str and summary functions,
+we can examine individual columns of the data frame with our
+typeof function:
We can also interrogate the data frame for information about its
+dimensions; remembering that str(gapminder) said there were
+1704 observations of 6 variables in gapminder, what do you think the
+following will produce, and why?
+
+
R
+
+
+length(gapminder)
+
+
+
OUTPUT
+
+
[1] 6
+
+
A fair guess would have been to say that the length of a data frame
+would be the number of rows it has (1704), but this is not the case;
+remember, a data frame is a list of vectors and factors:
+
+
R
+
+
+typeof(gapminder)
+
+
+
OUTPUT
+
+
[1] "list"
+
+
When length gave us 6, it’s because gapminder is built
+out of a list of 6 columns. To get the number of rows and columns in our
+dataset, try:
+
+
R
+
+
+nrow(gapminder)
+
+
+
OUTPUT
+
+
[1] 1704
+
+
+
R
+
+
+ncol(gapminder)
+
+
+
OUTPUT
+
+
[1] 6
+
+
Or, both at once:
+
+
R
+
+
+dim(gapminder)
+
+
+
OUTPUT
+
+
[1] 1704 6
+
+
We’ll also likely want to know what the titles of all the columns
+are, so we can ask for them later:
At this stage, it’s important to ask ourselves if the structure R is
+reporting matches our intuition or expectations; do the basic data types
+reported for each column make sense? If not, we need to sort any
+problems out now before they turn into bad surprises down the road,
+using what we’ve learned about how R interprets data, and the importance
+of strict consistency in how we record our data.
+
Once we’re happy that the data types and structures seem reasonable,
+it’s time to start digging into our data proper. Check out the first few
+lines:
+
+
R
+
+
+head(gapminder)
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+4 Afghanistan 1967 11537966 Asia 34.020 836.1971
+5 Afghanistan 1972 13079460 Asia 36.088 739.9811
+6 Afghanistan 1977 14880372 Asia 38.438 786.1134
+
+
+
+
+
+
+
Challenge 2
+
+
+
It’s good practice to also check the last few lines of your data and
+some in the middle. How would you do this?
+
Searching for ones specifically in the middle isn’t too hard, but we
+could ask for a few lines at random. How would you code this?
+
+
+
+
+
+
+
+
+
To check the last few lines it’s relatively simple as R already has a
+function for this:
+
+
R
+
+
+tail(gapminder)
+tail(gapminder, n =15)
+
+
What about a few arbitrary rows just in case something is odd in the
+middle?
+
+
Tip: There are several ways to achieve this.
+
+
The solution here presents one form of using nested functions, i.e. a
+function passed as an argument to another function. This might sound
+like a new concept, but you are already using it! Remember
+my_dataframe[rows, cols] will print to screen your data frame with the
+number of rows and columns you asked for (although you might have asked
+for a range or named columns for example). How would you get the last
+row if you don’t know how many rows your data frame has? R has a
+function for this. What about getting a (pseudorandom) sample? R also
+has a function for this.
+
+
R
+
+
+gapminder[sample(nrow(gapminder), 5), ]
+
+
+
+
+
+
+
To make sure our analysis is reproducible, we should put the code
+into a script file so we can come back to it later.
+
+
+
+
+
+
Challenge 3
+
+
+
Go to file -> new file -> R script, and write an R script to
+load in the gapminder dataset. Put it in the scripts/
+directory and add it to version control.
+
Run the script using the source function, using the file
+path as its argument (or by pressing the “source” button in
+RStudio).
+
+
+
+
+
+
+
+
+
The source function can be used to use a script within a
+script. Assume you would like to load the same type of file over and
+over again and therefore you need to specify the arguments to fit the
+needs of your file. Instead of writing the necessary argument again and
+again you could just write it once and save it as a script. Then, you
+can use source("Your_Script_containing_the_load_function")
+in a new script to use the function of that script without writing
+everything again. Check out ?source to find out more.
To run the script and load the data into the gapminder
+variable:
+
+
R
+
+
+source(file ="scripts/load-gapminder.R")
+
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Read the output of str(gapminder) again; this time, use
+what you’ve learned about lists and vectors, as well as the output of
+functions like colnames and dim to explain
+what everything that str prints out for gapminder means. If
+there are any parts you can’t interpret, discuss with your
+neighbors!
+
+
+
+
+
+
+
+
+
The object gapminder is a data frame with columns
+
+
+country and continent are character
+strings.
+
+year is an integer vector.
+
+pop, lifeExp, and gdpPercap
+are numeric vectors.
+
+
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
+
Use cbind() to add a new column to a data frame.
+
Use rbind() to add a new row to a data frame.
+
Remove rows from a data frame.
+
Use str(), summary(), nrow(),
+ncol(), dim(), colnames(),
+rownames(), head(), and typeof()
+to understand the structure of a data frame.
+
Read in a csv file using read.csv().
+
Understand what length() of a data frame
+represents.
In R, simple vectors containing character strings, numbers, or
+logical values are called atomic vectors because they can’t be
+further simplified.
+
+
+
+
So now that we’ve created a dummy vector to play with, how do we get
+at its contents?
+
Accessing elements using their indices
+
+
+
To extract elements of a vector we can give their corresponding
+index, starting from one:
+
+
R
+
+
+x[1]
+
+
+
OUTPUT
+
+
a
+5.4
+
+
+
R
+
+
+x[4]
+
+
+
OUTPUT
+
+
d
+4.8
+
+
It may look different, but the square brackets operator is a
+function. For vectors (and matrices), it means “get me the nth
+element”.
+
We can ask for multiple elements at once:
+
+
R
+
+
+x[c(1, 3)]
+
+
+
OUTPUT
+
+
a c
+5.4 7.1
+
+
Or slices of the vector:
+
+
R
+
+
+x[1:4]
+
+
+
OUTPUT
+
+
a b c d
+5.4 6.2 7.1 4.8
+
+
the : operator creates a sequence of numbers from the
+left element to the right.
+
+
R
+
+
+1:4
+
+
+
OUTPUT
+
+
[1] 1 2 3 4
+
+
+
R
+
+
+c(1, 2, 3, 4)
+
+
+
OUTPUT
+
+
[1] 1 2 3 4
+
+
We can ask for the same element multiple times:
+
+
R
+
+
+x[c(1,1,3)]
+
+
+
OUTPUT
+
+
a a c
+5.4 5.4 7.1
+
+
If we ask for an index beyond the length of the vector, R will return
+a missing value:
+
+
R
+
+
+x[6]
+
+
+
OUTPUT
+
+
<NA>
+ NA
+
+
This is a vector of length one containing an NA, whose
+name is also NA.
+
If we ask for the 0th element, we get an empty vector:
+
+
R
+
+
+x[0]
+
+
+
OUTPUT
+
+
named numeric(0)
+
+
+
+
+
+
+
Vector numbering in R starts at 1
+
+
+
In many programming languages (C and Python, for example), the first
+element of a vector has an index of 0. In R, the first element is 1.
+
+
+
+
Skipping and removing elements
+
+
+
If we use a negative number as the index of a vector, R will return
+every element except for the one specified:
+
+
R
+
+
+x[-2]
+
+
+
OUTPUT
+
+
a c d e
+5.4 7.1 4.8 7.5
+
+
We can skip multiple elements:
+
+
R
+
+
+x[c(-1, -5)]# or x[-c(1,5)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
+
+
+
+
Tip: Order of operations
+
+
+
A common trip up for novices occurs when trying to skip slices of a
+vector. It’s natural to try to negate a sequence like so:
+
+
R
+
+
+x[-1:3]
+
+
This gives a somewhat cryptic error:
+
+
ERROR
+
+
Error in x[-1:3]: only 0's may be mixed with negative subscripts
+
+
But remember the order of operations. : is really a
+function. It takes its first argument as -1, and its second as 3, so
+generates the sequence of numbers: c(-1, 0, 1, 2, 3).
+
The correct solution is to wrap that function call in brackets, so
+that the - operator applies to the result:
+
+
R
+
+
+x[-(1:3)]
+
+
+
OUTPUT
+
+
d e
+4.8 7.5
+
+
+
+
+
To remove elements from a vector, we need to assign the result back
+into the variable:
Come up with at least 2 different commands that will produce the
+following output:
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
After you find 2 different commands, compare notes with your
+neighbour. Did you have different strategies?
+
+
+
+
+
+
+
+
+
+
R
+
+
+x[2:4]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
R
+
+
+x[-c(1,5)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
R
+
+
+x[c(2,3,4)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
+
+
+
Subsetting by name
+
+
+
We can extract elements by using their name, instead of extracting by
+index:
+
+
R
+
+
+x<-c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5)# we can name a vector 'on the fly'
+x[c("a", "c")]
+
+
+
OUTPUT
+
+
a c
+5.4 7.1
+
+
This is usually a much more reliable way to subset objects: the
+position of various elements can often change when chaining together
+subsetting operations, but the names will always remain the same!
+
Subsetting through other logical operations
+
+
+
We can also use any logical vector to subset:
+
+
R
+
+
+x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
+
+
+
OUTPUT
+
+
c e
+7.1 7.5
+
+
Since comparison operators (e.g. >,
+<, ==) evaluate to logical vectors, we can
+also use them to succinctly subset vectors: the following statement
+gives the same result as the previous one.
+
+
R
+
+
+x[x>7]
+
+
+
OUTPUT
+
+
c e
+7.1 7.5
+
+
Breaking it down, this statement first evaluates x>7,
+generating a logical vector
+c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the
+elements of x corresponding to the TRUE
+values.
+
We can use == to mimic the previous method of indexing
+by name (remember you have to use == rather than
+= for comparisons):
+
+
R
+
+
+x[names(x)=="a"]
+
+
+
OUTPUT
+
+
a
+5.4
+
+
+
+
+
+
+
Tip: Combining logical conditions
+
+
+
We often want to combine multiple logical criteria. For example, we
+might want to find all the countries that are located in Asia
+or Europe and have life expectancies
+within a certain range. Several operations for combining logical vectors
+exist in R:
+
+
+&, the “logical AND” operator: returns
+TRUE if both the left and right are TRUE.
+
+|, the “logical OR” operator: returns
+TRUE, if either the left or right (or both) are
+TRUE.
+
+
You may sometimes see && and ||
+instead of & and |. These two-character
+operators only look at the first element of each vector and ignore the
+remaining elements. In general you should not use the two-character
+operators in data analysis; save them for programming, i.e. deciding
+whether to execute a statement.
+
+
+!, the “logical NOT” operator: converts
+TRUE to FALSE and FALSE to
+TRUE. It can negate a single logical condition (eg
+!TRUE becomes FALSE), or a whole vector of
+conditions(eg !c(TRUE, FALSE) becomes
+c(FALSE, TRUE)).
+
+
Additionally, you can compare the elements within a single vector
+using the all function (which returns TRUE if
+every element of the vector is TRUE) and the
+any function (which returns TRUE if one or
+more elements of the vector are TRUE).
Write a subsetting command to return the values in x that are greater
+than 4 and less than 7.
+
+
+
+
+
+
+
+
+
+
R
+
+
+x_subset<-x[x<7&x>4]
+print(x_subset)
+
+
+
OUTPUT
+
+
a b d
+5.4 6.2 4.8
+
+
+
+
+
+
+
+
+
+
+
Tip: Non-unique names
+
+
+
You should be aware that it is possible for multiple elements in a
+vector to have the same name. (For a data frame, columns can have the
+same name — although R tries to avoid this — but row names must be
+unique.) Consider these examples:
+
+
R
+
+
+x<-1:3
+x
+
+
+
OUTPUT
+
+
[1] 1 2 3
+
+
+
R
+
+
+names(x)<-c('a', 'a', 'a')
+x
+
+
+
OUTPUT
+
+
a a a
+1 2 3
+
+
+
R
+
+
+x['a']# only returns first value
+
+
+
OUTPUT
+
+
a
+1
+
+
+
R
+
+
+x[names(x)=='a']# returns all three values
+
+
+
OUTPUT
+
+
a a a
+1 2 3
+
+
+
+
+
+
+
+
+
+
Tip: Getting help for operators
+
+
+
Remember you can search for help on operators by wrapping them in
+quotes: help("%in%") or ?"%in%".
+
+
+
+
Skipping named elements
+
+
+
Skipping or removing named elements is a little harder. If we try to
+skip one named element by negating the string, R complains (slightly
+obscurely) that it doesn’t know how to take the negative of a
+string:
+
+
R
+
+
+x<-c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5)# we start again by naming a vector 'on the fly'
+x[-"a"]
+
+
+
ERROR
+
+
Error in -"a": invalid argument to unary operator
+
+
However, we can use the != (not-equals) operator to
+construct a logical vector that will do what we want:
+
+
R
+
+
+x[names(x)!="a"]
+
+
+
OUTPUT
+
+
b c d e
+6.2 7.1 4.8 7.5
+
+
Skipping multiple named indices is a little bit harder still. Suppose
+we want to drop the "a" and "c" elements, so
+we try this:
+
+
R
+
+
+x[names(x)!=c("a","c")]
+
+
+
WARNING
+
+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+
OUTPUT
+
+
b c d e
+6.2 7.1 4.8 7.5
+
+
R did something, but it gave us a warning that we ought to
+pay attention to - and it apparently gave us the wrong answer
+(the "c" element is still included in the vector)!
+
So what does != actually do in this case? That’s an
+excellent question.
+
+
Recycling
+
+
Let’s take a look at the comparison component of this code:
+
+
R
+
+
+names(x)!=c("a", "c")
+
+
+
WARNING
+
+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+
OUTPUT
+
+
[1] FALSE TRUE TRUE TRUE TRUE
+
+
Why does R give TRUE as the third element of this
+vector, when names(x)[3] != "c" is obviously false? When
+you use !=, R tries to compare each element of the left
+argument with the corresponding element of its right argument. What
+happens when you compare vectors of different lengths?
+
When one vector is shorter than the other, it gets
+recycled:
+
In this case R repeatsc("a", "c") as
+many times as necessary to match names(x), i.e. we get
+c("a","c","a","c","a"). Since the recycled "a"
+doesn’t match the third element of names(x), the value of
+!= is TRUE. Because in this case the longer
+vector length (5) isn’t a multiple of the shorter vector length (2), R
+printed a warning message. If we had been unlucky and
+names(x) had contained six elements, R would
+silently have done the wrong thing (i.e., not what we intended
+it to do). This recycling rule can can introduce hard-to-find and subtle
+bugs!
+
The way to get R to do what we really want (match each
+element of the left argument with all of the elements of the
+right argument) it to use the %in% operator. The
+%in% operator goes through each element of its left
+argument, in this case the names of x, and asks, “Does this
+element occur in the second argument?”. Here, since we want to
+exclude values, we also need a ! operator to
+change “in” to “not in”:
+
+
R
+
+
+x[!names(x)%in%c("a","c")]
+
+
+
OUTPUT
+
+
b d e
+6.2 4.8 7.5
+
+
+
+
+
+
+
Challenge 3
+
+
+
Selecting elements of a vector that match any of a list of components
+is a very common data analysis task. For example, the gapminder data set
+contains country and continent variables, but
+no information between these two scales. Suppose we want to pull out
+information from southeast Asia: how do we set up an operation to
+produce a logical vector that is TRUE for all of the
+countries in southeast Asia and FALSE otherwise?
+
Suppose you have these data:
+
+
R
+
+
+seAsia<-c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder<-read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries<-unique(as.character(gapminder$country))
+
+
There’s a wrong way (using only ==), which will give you
+a warning; a clunky way (using the logical operators == and
+|); and an elegant way (using %in%). See
+whether you can come up with all three and explain how they (don’t)
+work.
+
+
+
+
+
+
+
+
+
+
The wrong way to do this problem is
+countries==seAsia. This gives a warning
+("In countries == seAsia : longer object length is not a multiple of shorter object length")
+and the wrong answer (a vector of all FALSE values),
+because none of the recycled values of seAsia happen to
+line up correctly with matching values in country.
+
The clunky (but technically correct) way to do this
+problem is
(or countries==seAsia[1] | countries==seAsia[2] | ...).
+This gives the correct values, but hopefully you can see how awkward it
+is (what if we wanted to select countries from a much longer list?).
+
+
The best way to do this problem is
+countries %in% seAsia, which is both correct and easy to
+type (and read).
+
+
+
+
+
+
+
Handling special values
+
+
+
At some point you will encounter functions in R that cannot handle
+missing, infinite, or undefined data.
+
There are a number of special functions you can use to filter out
+this data:
+
+
+is.na will return all positions in a vector, matrix, or
+data.frame containing NA (or NaN)
+
likewise, is.nan, and is.infinite will do
+the same for NaN and Inf.
+
+is.finite will return all positions in a vector,
+matrix, or data.frame that do not contain NA,
+NaN or Inf.
+
+na.omit will filter out all missing values from a
+vector
+
Factor subsetting
+
+
+
Now that we’ve explored the different ways to subset vectors, how do
+we subset the other data structures?
+
Factor subsetting works the same way as vector subsetting.
Unlike vectors, if we try to access a row or column outside of the
+matrix, R will throw an error:
+
+
R
+
+
+m[, c(3,6)]
+
+
+
ERROR
+
+
Error in m[, c(3, 6)]: subscript out of bounds
+
+
+
+
+
+
+
Tip: Higher dimensional arrays
+
+
+
when dealing with multi-dimensional arrays, each argument to
+[ corresponds to a dimension. For example, a 3D array, the
+first three arguments correspond to the rows, columns, and depth
+dimension.
+
+
+
+
Because matrices are vectors, we can also subset using only one
+argument:
+
+
R
+
+
+m[5]
+
+
+
OUTPUT
+
+
[1] 0.3295078
+
+
This usually isn’t useful, and often confusing to read. However it is
+useful to note that matrices are laid out in column-major
+format by default. That is the elements of the vector are arranged
+column-wise:
+
+
R
+
+
+matrix(1:6, nrow=2, ncol=3)
+
+
+
OUTPUT
+
+
[,1] [,2] [,3]
+[1,] 1 3 5
+[2,] 2 4 6
+
+
If you wish to populate the matrix by row, use
+byrow=TRUE:
+
+
R
+
+
+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
+
+
+
OUTPUT
+
+
[,1] [,2] [,3]
+[1,] 1 2 3
+[2,] 4 5 6
+
+
Matrices can also be subsetted using their rownames and column names
+instead of their row and column indices.
Which of the following commands will extract the values 11 and
+14?
+
+
A. m[2,4,2,5]
+
B. m[2:5]
+
C. m[4:5,2]
+
D. m[2,c(4,5)]
+
+
+
+
+
+
+
+
+
D
+
+
+
+
+
List subsetting
+
+
+
Now we’ll introduce some new subsetting operators. There are three
+functions used to subset lists. We’ve already seen these when learning
+about atomic vectors and matrices: [, [[, and
+$.
+
Using [ will always return a list. If you want to
+subset a list, but not extract an element, then you
+will likely use [.
+
+
R
+
+
+xlist<-list(a ="Software Carpentry", b =1:10, data =head(mtcars))
+xlist[1]
+
+
+
OUTPUT
+
+
$a
+[1] "Software Carpentry"
+
+
This returns a list with one element.
+
We can subset elements of a list exactly the same way as atomic
+vectors using [. Comparison operations however won’t work
+as they’re not recursive, they will try to condition on the data
+structures in each element of the list, not the individual elements
+within those data structures.
+xlist<-list(a ="Software Carpentry", b =1:10, data =head(mtcars))
+
+
Using your knowledge of both list and vector subsetting, extract the
+number 2 from xlist. Hint: the number 2 is contained within the “b” item
+in the list.
+
+
+
+
+
+
+
+
+
+
R
+
+
+xlist$b[2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
R
+
+
+xlist[[2]][2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
R
+
+
+xlist[["b"]][2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
+
+
+
+
+
+
+
+
Challenge 6
+
+
+
Given a linear model:
+
+
R
+
+
+mod<-aov(pop~lifeExp, data=gapminder)
+
+
Extract the residual degrees of freedom (hint:
+attributes() will help you)
+
+
+
+
+
+
+
+
+
+
R
+
+
+attributes(mod)## `df.residual` is one of the names of `mod`
+
+
+
R
+
+
+mod$df.residual
+
+
+
+
+
+
Data frames
+
+
+
Remember the data frames are lists underneath the hood, so similar
+rules apply. However they are also two dimensional objects:
+
[ with one argument will act the same way as for lists,
+where each list element corresponds to a column. The resulting object
+will be a data frame:
Similarly, [[ will act to extract a single
+column:
+
+
R
+
+
+head(gapminder[["lifeExp"]])
+
+
+
OUTPUT
+
+
[1] 28.801 30.332 31.997 34.020 36.088 38.438
+
+
And $ provides a convenient shorthand to extract columns
+by name:
+
+
R
+
+
+head(gapminder$year)
+
+
+
OUTPUT
+
+
[1] 1952 1957 1962 1967 1972 1977
+
+
With two arguments, [ behaves the same way as for
+matrices:
+
+
R
+
+
+gapminder[1:3,]
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+
+
If we subset a single row, the result will be a data frame (because
+the elements are mixed types):
+
+
R
+
+
+gapminder[3,]
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+
+
But for a single column the result will be a vector (this can be
+changed with the third argument, drop = FALSE).
+
+
+
+
+
+
Challenge 7
+
+
+
Fix each of the following common data frame subsetting errors:
+
+
Extract observations collected for the year 1957
+
+
+
R
+
+
gapminder[gapminder$year =1957,]
+
+
+
Extract all columns except 1 through to 4
+
+
+
R
+
+
+gapminder[,-1:4]
+
+
+
Extract the rows where the life expectancy is longer the 80
+years
+
+
+
R
+
+
+gapminder[gapminder$lifeExp>80]
+
+
+
Extract the first row, and the fourth and fifth columns
+(continent and lifeExp).
+
+
+
R
+
+
+gapminder[1, 4, 5]
+
+
+
Advanced: extract rows that contain information for the years 2002
+and 2007
+
+
+
R
+
+
+gapminder[gapminder$year==2002|2007,]
+
+
+
+
+
+
+
+
+
+
Fix each of the following common data frame subsetting errors:
Write conditional statements with if...else statements
+and ifelse().
+
Write and understand for() loops.
+
+
+
+
+
+
+
Often when we’re coding we want to control the flow of our actions.
+This can be done by setting actions to occur only if a condition or a
+set of conditions are met. Alternatively, we can also set an action to
+occur a particular number of times.
+
There are several ways you can control flow in R. For conditional
+statements, the most commonly used approaches are the constructs:
+
+
R
+
+
# if
+if (condition is true) {
+ perform action
+}
+
+# if ... else
+if (condition is true) {
+ perform action
+} else { # that is, if the condition is false,
+ perform alternative action
+}
+
+
Say, for example, that we want R to print a message if a variable
+x has a particular value:
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}
+
+x
+
+
+
OUTPUT
+
+
[1] 8
+
+
The print statement does not appear in the console because x is not
+greater than 10. To print a different message for numbers less than 10,
+we can add an else statement.
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}else{
+print("x is less than 10")
+}
+
+
+
OUTPUT
+
+
[1] "x is less than 10"
+
+
You can also test multiple conditions by using
+else if.
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}elseif(x>5){
+print("x is greater than 5, but less than 10")
+}else{
+print("x is less than 5")
+}
+
+
+
OUTPUT
+
+
[1] "x is greater than 5, but less than 10"
+
+
Important: when R evaluates the condition inside
+if() statements, it is looking for a logical element, i.e.,
+TRUE or FALSE. This can cause some headaches
+for beginners. For example:
+
+
R
+
+
+x<-4==3
+if(x){
+"4 equals 3"
+}else{
+"4 does not equal 3"
+}
+
+
+
OUTPUT
+
+
[1] "4 does not equal 3"
+
+
As we can see, the not equal message was printed because the vector x
+is FALSE
+
+
R
+
+
+x<-4==3
+x
+
+
+
OUTPUT
+
+
[1] FALSE
+
+
+
+
+
+
+
Challenge 1
+
+
+
Use an if() statement to print a suitable message
+reporting whether there are any records from 2002 in the
+gapminder dataset. Now do the same for 2012.
+
+
+
+
+
+
+
+
+
We will first see a solution to Challenge 1 which does not use the
+any() function. We first obtain a logical vector describing
+which element of gapminder$year is equal to
+2002:
+
+
R
+
+
+gapminder[(gapminder$year==2002),]
+
+
Then, we count the number of rows of the data.frame
+gapminder that correspond to the 2002:
The presence of any record for the year 2002 is equivalent to the
+request that rows2002_number is one or more:
+
+
R
+
+
+rows2002_number>=1
+
+
Putting all together, we obtain:
+
+
R
+
+
+if(nrow(gapminder[(gapminder$year==2002),])>=1){
+print("Record(s) for the year 2002 found.")
+}
+
+
All this can be done more quickly with any(). The
+logical condition can be expressed as:
+
+
R
+
+
+if(any(gapminder$year==2002)){
+print("Record(s) for the year 2002 found.")
+}
+
+
+
+
+
+
Did anyone get a warning message like this?
+
+
ERROR
+
+
Error in if (gapminder$year == 2012) {: the condition has length > 1
+
+
The if() function only accepts singular (of length 1)
+inputs, and therefore returns an error when you use it with a vector.
+The if() function will still run, but will only evaluate
+the condition in the first element of the vector. Therefore, to use the
+if() function, you need to make sure your input is singular
+(of length 1).
+
+
+
+
+
+
Tip: Built in ifelse()
+function
+
+
+
R accepts both if() and
+else if() statements structured as outlined above, but also
+statements using R’s built-in ifelse()
+function. This function accepts both singular and vector inputs and is
+structured as follows:
+
+
R
+
+
# ifelse function
+ifelse(condition is true, perform action, perform alternative action)
+
+
where the first argument is the condition or a set of conditions to
+be met, the second argument is the statement that is evaluated when the
+condition is TRUE, and the third statement is the statement
+that is evaluated when the condition is FALSE.
+
+
R
+
+
+y<--3
+ifelse(y<0, "y is a negative number", "y is either positive or zero")
+
+
+
OUTPUT
+
+
[1] "y is a negative number"
+
+
+
+
+
+
+
+
+
+
Tip: any() and
+all()
+
+
+
The any() function will return TRUE if at
+least one TRUE value is found within a vector, otherwise it
+will return FALSE. This can be used in a similar way to the
+%in% operator. The function all(), as the name
+suggests, will only return TRUE if all values in the vector
+are TRUE.
+
+
+
+
Repeating operations
+
+
+
If you want to iterate over a set of values, when the order of
+iteration is important, and perform the same operation on each, a
+for() loop will do the job. We saw for() loops
+in the shell
+lessons earlier. This is the most flexible of looping operations,
+but therefore also the hardest to use correctly. In general, the advice
+of many R users would be to learn about for()
+loops, but to avoid using for() loops unless the order of
+iteration is important: i.e. the calculation at each iteration depends
+on the results of previous iterations. If the order of iteration is not
+important, then you should learn about vectorized alternatives, such as
+the purrr package, as they pay off in computational
+efficiency.
We notice in the output that when the first index (i) is
+set to 1, the second index (j) iterates through its full
+set of indices. Once the indices of j have been iterated
+through, then i is incremented. This process continues
+until the last index has been used for each for() loop.
+
Rather than printing the results, we could write the loop output to a
+new object.
This approach can be useful, but ‘growing your results’ (building the
+result object incrementally) is computationally inefficient, so avoid it
+when you are iterating through a lot of values.
+
+
+
+
+
+
Tip: don’t grow your results
+
+
+
One of the biggest things that trips up novices and experienced R
+users alike, is building a results object (vector, list, matrix, data
+frame) as your for loop progresses. Computers are very bad at handling
+this, so your calculations can very quickly slow to a crawl. It’s much
+better to define an empty results object before hand of appropriate
+dimensions, rather than initializing an empty object without dimensions.
+So if you know the end result will be stored in a matrix like above,
+create an empty matrix with 5 row and 5 columns, then at each iteration
+store the results in the appropriate location.
+
+
+
+
A better way is to define your (empty) output object before filling
+in the values. For this example, it looks more involved, but is still
+more efficient.
Sometimes you will find yourself needing to repeat an operation as
+long as a certain condition is met. You can do this with a
+while() loop.
+
+
R
+
+
while(this condition is true){
+ do a thing
+}
+
+
R will interpret a condition being met as “TRUE”.
+
As an example, here’s a while loop that generates random numbers from
+a uniform distribution (the runif() function) between 0 and
+1 until it gets one that’s less than 0.1.
while() loops will not always be appropriate. You have
+to be particularly careful that you don’t end up stuck in an infinite
+loop because your condition is always met and hence the while statement
+never terminates.
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Compare the objects output_vector and
+output_vector2. Are they the same? If not, why not? How
+would you change the last block of code to make
+output_vector2 the same as output_vector?
+
+
+
+
+
+
+
+
+
We can check whether the two vectors are identical using the
+all() function:
+
+
R
+
+
+all(output_vector==output_vector2)
+
+
However, all the elements of output_vector can be found
+in output_vector2:
+
+
R
+
+
+all(output_vector%in%output_vector2)
+
+
and vice versa:
+
+
R
+
+
+all(output_vector2%in%output_vector)
+
+
therefore, the element in output_vector and
+output_vector2 are just sorted in a different order. This
+is because as.vector() outputs the elements of an input
+matrix going over its column. Taking a look at
+output_matrix, we can notice that we want its elements by
+rows. The solution is to transpose the output_matrix. We
+can do it either by calling the transpose function t() or
+by inputting the elements in the right order. The first solution
+requires to change the original
+
+
R
+
+
+output_vector2<-as.vector(output_matrix)
+
+
into
+
+
R
+
+
+output_vector2<-as.vector(t(output_matrix))
+
+
The second solution requires to change
+
+
R
+
+
+output_matrix[i, j]<-temp_output
+
+
into
+
+
R
+
+
+output_matrix[j, i]<-temp_output
+
+
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Write a script that loops through the gapminder data by
+continent and prints out whether the mean life expectancy is smaller or
+larger than 50 years.
+
+
+
+
+
+
+
+
+
Step 1: We want to make sure we can extract all the
+unique values of the continent vector
Step 2: We also need to loop over each of these
+continents and calculate the average life expectancy for each
+subset of data. We can do that as follows:
+
+
Loop over each of the unique values of ‘continent’
+
For each value of continent, create a temporary variable storing
+that subset
+
Return the calculated life expectancy to the user by printing the
+output:
Step 3: The exercise only wants the output printed
+if the average life expectancy is less than 50 or greater than 50. So we
+need to add an if() condition before printing, which
+evaluates whether the calculated average life expectancy is above or
+below a threshold, and prints an output conditional on the result. We
+need to amend (3) from above:
+
3a. If the calculated life expectancy is less than some threshold (50
+years), return the continent and a statement that life expectancy is
+less than threshold, otherwise return the continent and a statement that
+life expectancy is greater than threshold:
+
+
R
+
+
+thresholdValue<-50
+
+for(iContinentinunique(gapminder$continent)){
+tmp<-mean(gapminder[gapminder$continent==iContinent, "lifeExp"])
+
+if(tmp<thresholdValue){
+cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+}else{
+cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+}# end if else condition
+rm(tmp)
+}# end for loop
+
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Modify the script from Challenge 3 to loop over each country. This
+time print out whether the life expectancy is smaller than 50, between
+50 and 70, or greater than 70.
+
+
+
+
+
+
+
+
+
We modify our solution to Challenge 3 by now adding two thresholds,
+lowerThreshold and upperThreshold and
+extending our if-else statements:
Write a script that loops over each country in the
+gapminder dataset, tests whether the country starts with a
+‘B’, and graphs life expectancy against time as a line graph if the mean
+life expectancy is under 50 years.
+
+
+
+
+
+
+
+
+
We will use the grep() command that was introduced in
+the Unix
+Shell lesson to find countries that start with “B.” Lets understand
+how to do this first. Following from the Unix shell section we may be
+tempted to try the following
+
+
R
+
+
+grep("^B", unique(gapminder$country))
+
+
But when we evaluate this command it returns the indices of the
+factor variable country that start with “B.” To get the
+values, we must add the value=TRUE option to the
+grep() command:
+
+
R
+
+
+grep("^B", unique(gapminder$country), value =TRUE)
+
+
We will now store these countries in a variable called
+candidateCountries, and then loop over each entry in the variable.
+Inside the loop, we evaluate the average life expectancy for each
+country, and if the average life expectancy is less than 50 we use
+base-plot to plot the evolution of average life expectancy using
+with() and subset():
+
+
R
+
+
+thresholdValue<-50
+candidateCountries<-grep("^B", unique(gapminder$country), value =TRUE)
+
+for(iCountryincandidateCountries){
+tmp<-mean(gapminder[gapminder$country==iCountry, "lifeExp"])
+
+if(tmp<thresholdValue){
+cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+
+with(subset(gapminder, country==iCountry),
+plot(year, lifeExp,
+ type ="o",
+ main =paste("Life Expectancy in", iCountry, "over time"),
+ ylab ="Life Expectancy",
+ xlab ="Year"
+)# end plot
+)# end with
+}# end if
+rm(tmp)
+}# end for loop
Today we’ll be learning about the ggplot2 package, because it is the
+most effective for creating publication-quality graphics.
+
ggplot2 is built on the grammar of graphics, the idea that any plot
+can be built from the same set of components: a data
+set, mapping aesthetics, and graphical
+layers:
+
+
Data sets are the data that you, the user,
+provide.
+
Mapping aesthetics are what connect the data to
+the graphics. They tell ggplot2 how to use your data to affect how the
+graph looks, such as changing what is plotted on the X or Y axis, or the
+size or color of different data points.
+
Layers are the actual graphical output from
+ggplot2. Layers determine what kinds of plot are shown (scatterplot,
+histogram, etc.), the coordinate system used (rectangular, polar,
+others), and other important aspects of the plot. The idea of layers of
+graphics may be familiar to you if you have used image editing programs
+like Photoshop, Illustrator, or Inkscape.
+
+
Let’s start off building an example using the gapminder data from
+earlier. The most basic function is ggplot, which lets R
+know that we’re creating a new plot. Any of the arguments we give the
+ggplot function are the global options for the
+plot: they apply to all layers on the plot.
+
+
R
+
+
+library("ggplot2")
+ggplot(data =gapminder)
+
+
Here we called ggplot and told it what data we want to
+show on our figure. This is not enough information for
+ggplot to actually draw anything. It only creates a blank
+slate for other elements to be added to.
+
Now we’re going to add in the mapping aesthetics
+using the aes function. aes tells
+ggplot how variables in the data map to
+aesthetic properties of the figure, such as which columns of
+the data should be used for the x and
+y locations.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))
+
+
Here we told ggplot we want to plot the “gdpPercap”
+column of the gapminder data frame on the x-axis, and the “lifeExp”
+column on the y-axis. Notice that we didn’t need to explicitly pass
+aes these columns
+(e.g. x = gapminder[, "gdpPercap"]), this is because
+ggplot is smart enough to know to look in the
+data for that column!
+
The final part of making our plot is to tell ggplot how
+we want to visually represent the data. We do this by adding a new
+layer to the plot using one of the
+geom functions.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point()
+
+
Here we used geom_point, which tells ggplot
+we want to visually represent the relationship between
+x and y as a scatterplot of
+points.
+
+
+
+
+
+
Challenge 1
+
+
+
Modify the example so that the figure shows how life expectancy has
+changed over time:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+geom_point()
+
+
Hint: the gapminder dataset has a column called “year”, which should
+appear on the x-axis.
+
+
+
+
+
+
+
+
+
Here is one possible solution:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =year, y =lifeExp))+geom_point()
+
+
+
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
In the previous examples and challenge we’ve used the
+aes function to tell the scatterplot geom
+about the x and y locations of each
+point. Another aesthetic property we can modify is the point
+color. Modify the code from the previous challenge to
+color the points by the “continent” column. What trends
+do you see in the data? Are they what you expected?
+
+
+
+
+
+
+
+
+
The solution presented below adds color=continent to the
+call of the aes function. The general trend seems to
+indicate an increased life expectancy over the years. On continents with
+stronger economies we find a longer life expectancy.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_point()
+
+
+
+
+
+
+
Layers
+
+
+
Using a scatterplot probably isn’t the best for visualizing change
+over time. Instead, let’s tell ggplot to visualize the data
+as a line plot:
Instead of adding a geom_point layer, we’ve added a
+geom_line layer.
+
However, the result doesn’t look quite as we might have expected: it
+seems to be jumping around a lot in each continent. Let’s try to
+separate the data by country, plotting one line for each country:
It’s important to note that each layer is drawn on top of the
+previous layer. In this example, the points have been drawn on top
+of the lines. Here’s a demonstration:
In this example, the aesthetic mapping of
+color has been moved from the global plot options in
+ggplot to the geom_line layer so it no longer
+applies to the points. Now we can clearly see that the points are drawn
+on top of the lines.
+
+
+
+
+
+
Tip: Setting an aesthetic to a value instead
+of a mapping
+
+
+
So far, we’ve seen how to use an aesthetic (such as
+color) as a mapping to a variable in the data.
+For example, when we use
+geom_line(mapping = aes(color=continent)), ggplot will give
+a different color to each continent. But what if we want to change the
+color of all lines to blue? You may think that
+geom_line(mapping = aes(color="blue")) should work, but it
+doesn’t. Since we don’t want to create a mapping to a specific variable,
+we can move the color specification outside of the aes()
+function, like this: geom_line(color="blue").
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Switch the order of the point and line layers from the previous
+example. What happened?
ggplot2 also makes it easy to overlay statistical models over the
+data. To demonstrate we’ll go back to our first example:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point()
+
+
Currently it’s hard to see the relationship between the points due to
+some strong outliers in GDP per capita. We can change the scale of units
+on the x axis using the scale functions. These control the
+mapping between the data values and visual values of an aesthetic. We
+can also modify the transparency of the points, using the alpha
+function, which is especially helpful when you have a large amount of
+data which is very clustered.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()
+
+
The scale_x_log10 function applied a transformation to
+the coordinate system of the plot, so that each multiple of 10 is evenly
+spaced from left to right. For example, a GDP per capita of 1,000 is the
+same horizontal distance away from a value of 10,000 as the 10,000 value
+is from 100,000. This helps to visualize the spread of the data along
+the x-axis.
+
+
+
+
+
+
Tip Reminder: Setting an aesthetic to a value
+instead of a mapping
+
+
+
Notice that we used geom_point(alpha = 0.5). As the
+previous tip mentioned, using a setting outside of the
+aes() function will cause this value to be used for all
+points, which is what we want in this case. But just like any other
+aesthetic setting, alpha can also be mapped to a variable in
+the data. For example, we can give a different transparency to each
+continent with
+geom_point(mapping = aes(alpha = continent)).
+
+
+
+
We can fit a simple relationship to the data by adding another layer,
+geom_smooth:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()+geom_smooth(method="lm")
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
We can make the line thicker by setting the
+size aesthetic in the geom_smooth
+layer:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()+geom_smooth(method="lm", size=1.5)
+
+
+
WARNING
+
+
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+generated.
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
There are two ways an aesthetic can be specified. Here we
+set the size aesthetic by passing it as an
+argument to geom_smooth. Previously in the lesson we’ve
+used the aes function to define a mapping between
+data variables and their visual representation.
+
+
+
+
+
+
Challenge 4a
+
+
+
Modify the color and size of the points on the point layer in the
+previous example.
+
Hint: do not use the aes function.
+
+
+
+
+
+
+
+
+
Here a possible solution: Notice that the color argument
+is supplied outside of the aes() function. This means that
+it applies to all data points on the graph and is not related to a
+specific variable.
Modify your solution to Challenge 4a so that the points are now a
+different shape and are colored by continent with new trendlines. Hint:
+The color argument can be used inside the aesthetic.
+
+
+
+
+
+
+
+
+
Here is a possible solution: Notice that supplying the
+color argument inside the aes() functions
+enables you to connect it to a certain variable. The shape
+argument, as you can see, modifies all data points the same way (it is
+outside the aes() call) while the color
+argument which is placed inside the aes() call modifies a
+point’s color based on its continent value.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp, color =continent))+
+geom_point(size=3, shape=17)+scale_x_log10()+
+geom_smooth(method="lm", size=1.5)
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
+
+
+
+
Multi-panel figures
+
+
+
Earlier we visualized the change in life expectancy over time across
+all countries in one plot. Alternatively, we can split this out over
+multiple panels by adding a layer of facet panels.
+
+
+
+
+
+
Tip
+
+
+
We start by making a subset of data including only countries located
+in the Americas. This includes 25 countries, which will begin to clutter
+the figure. Note that we apply a “theme” definition to rotate the x-axis
+labels to maintain readability. Nearly everything in ggplot2 is
+customizable.
The facet_wrap layer took a “formula” as its argument,
+denoted by the tilde (~). This tells R to draw a panel for each unique
+value in the country column of the gapminder dataset.
+
Modifying text
+
+
+
To clean this figure up for a publication we need to change some of
+the text elements. The x-axis is too cluttered, and the y axis should
+read “Life expectancy”, rather than the column name in the data
+frame.
+
We can do this by adding a couple of different layers. The
+theme layer controls the axis text, and overall text
+size. Labels for the axes, plot title and any legend can be set using
+the labs function. Legend titles are set using the same
+names we used in the aes specification. Thus below the
+color legend title is set using color = "Continent", while
+the title of a fill legend would be set using
+fill = "MyTitle".
+
+
R
+
+
+ggplot(data =americas, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_line()+facet_wrap(~country)+
+labs(
+ x ="Year", # x axis title
+ y ="Life expectancy", # y axis title
+ title ="Figure 1", # main title of figure
+ color ="Continent"# title of legend
+)+
+theme(axis.text.x =element_text(angle =90, hjust =1))
+
+
Exporting the plot
+
+
+
The ggsave() function allows you to export a plot
+created with ggplot. You can specify the dimension and resolution of
+your plot by adjusting the appropriate arguments (width,
+height and dpi) to create high quality
+graphics for publication. In order to save the plot from above, we first
+assign it to a variable lifeExp_plot, then tell
+ggsave to save that plot in png format to a
+directory called results. (Make sure you have a
+results/ folder in your working directory.)
+
+
R
+
+
+lifeExp_plot<-ggplot(data =americas, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_line()+facet_wrap(~country)+
+labs(
+ x ="Year", # x axis title
+ y ="Life expectancy", # y axis title
+ title ="Figure 1", # main title of figure
+ color ="Continent"# title of legend
+)+
+theme(axis.text.x =element_text(angle =90, hjust =1))
+
+ggsave(filename ="results/lifeExp.png", plot =lifeExp_plot, width =12, height =10, dpi =300, units ="cm")
+
+
There are two nice things about ggsave. First, it
+defaults to the last plot, so if you omit the plot argument
+it will automatically save the last plot you created with
+ggplot. Secondly, it tries to determine the format you want
+to save your plot in from the file extension you provide for the
+filename (for example .png or .pdf). If you
+need to, you can specify the format explicitly in the
+device argument.
+
This is a taste of what you can do with ggplot2. RStudio provides a
+really useful cheat
+sheet of the different layers available, and more extensive
+documentation is available on the ggplot2 website. All
+RStudio cheat sheets can be found here. Finally,
+if you have no idea how to change something, a quick Google search will
+usually send you to a relevant question and answer on Stack Overflow
+with reusable code to modify!
+
+
+
+
+
+
Challenge 5
+
+
+
Generate boxplots to compare life expectancy between the different
+continents during the available years.
+
Advanced:
+
+
Rename y axis as Life Expectancy.
+
Remove x axis labels.
+
+
+
+
+
+
+
+
+
+
Here a possible solution: xlab() and ylab()
+set labels for the x and y axes, respectively The axis title, text and
+ticks are attributes of the theme and must be modified within a
+theme() call.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =continent, y =lifeExp, fill =continent))+
+geom_boxplot()+facet_wrap(~year)+
+ylab("Life Expectancy")+
+theme(axis.title.x=element_blank(),
+ axis.text.x =element_blank(),
+ axis.ticks.x =element_blank())
+
+
+
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
+
Use ggplot2 to create plots.
+
Think about graphics in layers: aesthetics, geometry, statistics,
+scale transformation, and grouping.
How can I operate on all the elements of a vector at once?
+
+
+
+
+
+
+
+
Objectives
+
+
To understand vectorized operations in R.
+
+
+
+
+
+
+
Most of R’s functions are vectorized, meaning that the function will
+operate on all elements of a vector without needing to loop through and
+act on each element one at a time. This makes writing code more concise,
+easy to read, and less error prone.
+
+
R
+
+
+x<-1:4
+x*2
+
+
+
OUTPUT
+
+
[1] 2 4 6 8
+
+
The multiplication happened to each element of the vector.
+
We can also add two vectors together:
+
+
R
+
+
+y<-6:9
+x+y
+
+
+
OUTPUT
+
+
[1] 7 9 11 13
+
+
Each element of x was added to its corresponding element
+of y:
+
+
R
+
+
x:1234
+++++
+y:6789
+---------------
+791113
+
+
Here is how we would add two vectors together using a for loop:
Compare this to the output using vectorised operations.
+
+
R
+
+
+sum_xy<-x+y
+sum_xy
+
+
+
OUTPUT
+
+
[1] 7 9 11 13
+
+
+
+
+
+
+
Challenge 1
+
+
+
Let’s try this on the pop column of the
+gapminder dataset.
+
Make a new column in the gapminder data frame that
+contains population in units of millions of people. Check the head or
+tail of the data frame to make sure it worked.
+
+
+
+
+
+
+
+
+
Let’s try this on the pop column of the
+gapminder dataset.
+
Make a new column in the gapminder data frame that
+contains population in units of millions of people. Check the head or
+tail of the data frame to make sure it worked.
Operations can also be performed on vectors of unequal length,
+through a process known as recycling. This process
+automatically repeats the smaller vector until it matches the length of
+the larger vector. R will provide a warning if the larger vector is not
+a multiple of the smaller vector.
+
+
R
+
+
+x<-c(1, 2, 3)
+y<-c(1, 2, 3, 4, 5, 6, 7)
+x+y
+
+
+
WARNING
+
+
Warning in x + y: longer object length is not a multiple of shorter object
+length
+
+
+
OUTPUT
+
+
[1] 2 4 6 5 7 9 8
+
+
Vector x was recycled to match the length of vector
+y
Check argument conditions with stopifnot() in
+functions.
+
Test a function.
+
Set default values for function arguments.
+
Explain why we should divide programs into small, single-purpose
+functions.
+
+
+
+
+
+
+
If we only had one data set to analyze, it would probably be faster
+to load the file into a spreadsheet and use that to plot simple
+statistics. However, the gapminder data is updated periodically, and we
+may want to pull in that new information later and re-run our analysis
+again. We may also obtain similar data from a different source in the
+future.
+
In this lesson, we’ll learn how to write a function so that we can
+repeat several operations with a single command.
+
+
+
+
+
+
What is a function?
+
+
+
Functions gather a sequence of operations into a whole, preserving it
+for ongoing use. Functions provide:
+
+
a name we can remember and invoke it by
+
relief from the need to remember the individual operations
+
a defined set of inputs and expected outputs
+
rich connections to the larger programming environment
+
+
As the basic building block of most programming languages,
+user-defined functions constitute “programming” as much as any single
+abstraction can. If you have written a function, you are a computer
+programmer.
+
+
+
+
Defining a function
+
+
+
Let’s open a new R script file in the functions/
+directory and call it functions-lesson.R.
+
The general structure of a function is:
+
+
R
+
+
+my_function<-function(parameters){
+# perform action
+# return value
+}
+
+
Let’s define a function fahr_to_kelvin() that converts
+temperatures from Fahrenheit to Kelvin:
We define fahr_to_kelvin() by assigning it to the output
+of function. The list of argument names are contained
+within parentheses. Next, the body of
+the function–the statements that are executed when it runs–is contained
+within curly braces ({}). The statements in the body are
+indented by two spaces. This makes the code easier to read but does not
+affect how the code operates.
+
It is useful to think of creating functions like writing a cookbook.
+First you define the “ingredients” that your function needs. In this
+case, we only need one ingredient to use our function: “temp”. After we
+list our ingredients, we then say what we will do with them, in this
+case, we are taking our ingredient and applying a set of mathematical
+operators to it.
+
When we call the function, the values we pass to it as arguments are
+assigned to those variables so that we can use them inside the function.
+Inside the function, we use a return statement to send a
+result back to whoever asked for it.
+
+
+
+
+
+
Tip
+
+
+
One feature unique to R is that the return statement is not required.
+R automatically returns whichever variable is on the last line of the
+body of the function. But for clarity, we will explicitly define the
+return statement.
+
+
+
+
Let’s try running our function. Calling our own function is no
+different from calling any other function:
+
+
R
+
+
+# freezing point of water
+fahr_to_kelvin(32)
+
+
+
OUTPUT
+
+
[1] 273.15
+
+
+
R
+
+
+# boiling point of water
+fahr_to_kelvin(212)
+
+
+
OUTPUT
+
+
[1] 373.15
+
+
+
+
+
+
+
Challenge 1
+
+
+
Write a function called kelvin_to_celsius() that takes a
+temperature in Kelvin and returns that temperature in Celsius.
+
Hint: To convert from Kelvin to Celsius you subtract 273.15
+
+
+
+
+
+
+
+
+
Write a function called kelvin_to_celsius that takes a
+temperature in Kelvin and returns that temperature in Celsius
Now that we’ve begun to appreciate how writing functions provides an
+efficient way to make R code re-usable and modular, we should note that
+it is important to ensure that functions only work in their intended
+use-cases. Checking function parameters is related to the concept of
+defensive programming. Defensive programming encourages us to
+frequently check conditions and throw an error if something is wrong.
+These checks are referred to as assertion statements because we want to
+assert some condition is TRUE before proceeding. They make
+it easier to debug because they give us a better idea of where the
+errors originate.
+
+
Checking conditions with stopifnot()
+
+
+
Let’s start by re-examining fahr_to_kelvin(), our
+function for converting temperatures from Fahrenheit to Kelvin. It was
+defined like so:
For this function to work as intended, the argument temp
+must be a numeric value; otherwise, the mathematical
+procedure for converting between the two temperature scales will not
+work. To create an error, we can use the function stop().
+For example, since the argument temp must be a
+numeric vector, we could check for this condition with an
+if statement and throw an error if the condition was
+violated. We could augment our function above like so:
+
+
R
+
+
+fahr_to_kelvin<-function(temp){
+if(!is.numeric(temp)){
+stop("temp must be a numeric vector.")
+}
+kelvin<-((temp-32)*(5/9))+273.15
+return(kelvin)
+}
+
+
If we had multiple conditions or arguments to check, it would take
+many lines of code to check all of them. Luckily R provides the
+convenience function stopifnot(). We can list as many
+requirements that should evaluate to TRUE;
+stopifnot() throws an error if it finds one that is
+FALSE. Listing these conditions also serves a secondary
+purpose as extra documentation for the function.
+
Let’s try out defensive programming with stopifnot() by
+adding assertions to check the input to our function
+fahr_to_kelvin().
+
We want to assert the following: temp is a numeric
+vector. We may do that like so:
+# freezing point of water
+fahr_to_kelvin(temp =32)
+
+
+
OUTPUT
+
+
[1] 273.15
+
+
But fails instantly if given improper input.
+
+
R
+
+
+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp =as.factor(32))
+
+
+
ERROR
+
+
Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+
+
+
+
+
+
+
Challenge 3
+
+
+
Use defensive programming to ensure that our
+fahr_to_celsius() function throws an error immediately if
+the argument temp is specified inappropriately.
+
+
+
+
+
+
+
+
+
Extend our previous definition of the function by adding in an
+explicit call to stopifnot(). Since
+fahr_to_celsius() is a composition of two other functions,
+checking inside here makes adding checks to the two component functions
+redundant.
Now, we’re going to define a function that calculates the Gross
+Domestic Product of a nation from the data available in our dataset:
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat){
+gdp<-dat$pop*dat$gdpPercap
+return(gdp)
+}
+
+
We define calcGDP() by assigning it to the output of
+function. The list of argument names are contained within
+parentheses. Next, the body of the function -- the statements executed
+when you call the function – is contained within curly braces
+({}).
+
We’ve indented the statements in the body by two spaces. This makes
+the code easier to read but does not affect how it operates.
+
When we call the function, the values we pass to it are assigned to
+the arguments, which become variables inside the body of the
+function.
+
Inside the function, we use the return() function to
+send back the result. This return() function is optional: R
+will automatically return the results of whatever command is executed on
+the last line of the function.
That’s not very informative. Let’s add some more arguments so we can
+extract that per year and country.
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat, year=NULL, country=NULL){
+if(!is.null(year)){
+dat<-dat[dat$year%in%year, ]
+}
+if(!is.null(country)){
+dat<-dat[dat$country%in%country,]
+}
+gdp<-dat$pop*dat$gdpPercap
+
+new<-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
If you’ve been writing these functions down into a separate R script
+(a good idea!), you can load in the functions into our R session by
+using the source() function:
+
+
R
+
+
+source("functions/functions-lesson.R")
+
+
Ok, so there’s a lot going on in this function now. In plain English,
+the function now subsets the provided data by year if the year argument
+isn’t empty, then subsets the result by country if the country argument
+isn’t empty. Then it calculates the GDP for whatever subset emerges from
+the previous two steps. The function then adds the GDP as a new column
+to the subsetted data and returns this as the final result. You can see
+that the output is much more informative than a vector of numbers.
+
Let’s take a look at what happens when we specify the year:
+
+
R
+
+
+head(calcGDP(gapminder, year=2007))
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap gdp
+12 Afghanistan 2007 31889923 Asia 43.828 974.5803 31079291949
+24 Albania 2007 3600523 Europe 76.423 5937.0295 21376411360
+36 Algeria 2007 33333216 Africa 72.301 6223.3675 207444851958
+48 Angola 2007 12420476 Africa 42.731 4797.2313 59583895818
+60 Argentina 2007 40301927 Americas 75.320 12779.3796 515033625357
+72 Australia 2007 20434176 Oceania 81.235 34435.3674 703658358894
+
+
Or for a specific country:
+
+
R
+
+
+calcGDP(gapminder, country="Australia")
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap gdp
+61 Australia 1952 8691212 Oceania 69.120 10039.60 87256254102
+62 Australia 1957 9712569 Oceania 70.330 10949.65 106349227169
+63 Australia 1962 10794968 Oceania 70.930 12217.23 131884573002
+64 Australia 1967 11872264 Oceania 71.100 14526.12 172457986742
+65 Australia 1972 13177000 Oceania 71.930 16788.63 221223770658
+66 Australia 1977 14074100 Oceania 73.490 18334.20 258037329175
+67 Australia 1982 15184200 Oceania 74.740 19477.01 295742804309
+68 Australia 1987 16257249 Oceania 76.320 21888.89 355853119294
+69 Australia 1992 17481977 Oceania 77.560 23424.77 409511234952
+70 Australia 1997 18565243 Oceania 78.830 26997.94 501223252921
+71 Australia 2002 19546792 Oceania 80.370 30687.75 599847158654
+72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894
Here we’ve added two arguments, year, and
+country. We’ve set default arguments for both as
+NULL using the = operator in the function
+definition. This means that those arguments will take on those values
+unless the user specifies otherwise.
Here, we check whether each additional argument is set to
+null, and whenever they’re not null overwrite
+the dataset stored in dat with a subset given by the
+non-null argument.
+
Building these conditionals into the function makes it more flexible
+for later. Now, we can use it to calculate the GDP for:
+
+
The whole dataset;
+
A single year;
+
A single country;
+
A single combination of year and country.
+
+
By using %in% instead, we can also give multiple years
+or countries to those arguments.
+
+
+
+
+
+
Tip: Pass by value
+
+
+
Functions in R almost always make copies of the data to operate on
+inside of a function body. When we modify dat inside the
+function we are modifying the copy of the gapminder dataset stored in
+dat, not the original variable we gave as the first
+argument.
+
This is called “pass-by-value” and it makes writing code much safer:
+you can always be sure that whatever changes you make within the body of
+the function, stay inside the body of the function.
+
+
+
+
+
+
+
+
+
Tip: Function scope
+
+
+
Another important concept is scoping: any variables (or functions!)
+you create or modify inside the body of a function only exist for the
+lifetime of the function’s execution. When we call
+calcGDP(), the variables dat, gdp
+and new only exist inside the body of the function. Even if
+we have variables of the same name in our interactive R session, they
+are not modified in any way when executing a function.
+
+
+
+
+
R
+
+
gdp <- dat$pop * dat$gdpPercap
+ new <-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
Finally, we calculated the GDP on our new subset, and created a new
+data frame with that column added. This means when we call the function
+later we can see the context for the returned GDP values, which is much
+better than in our first attempt where we got a vector of numbers.
+
+
+
+
+
+
Challenge 4
+
+
+
Test out your GDP function by calculating the GDP for New Zealand in
+1987. How does this differ from New Zealand’s GDP in 1952?
+
+
+
+
+
+
+
+
+
+
R
+
+
+calcGDP(gapminder, year =c(1952, 1987), country ="New Zealand")
+
+
GDP for New Zealand in 1987: 65050008703
+
GDP for New Zealand in 1952: 21058193787
+
+
+
+
+
+
+
+
+
+
Challenge 5
+
+
+
The paste() function can be used to combine text
+together, e.g:
Write a function called fence() that takes two vectors
+as arguments, called text and wrapper, and
+prints out the text wrapped with the wrapper:
+
+
R
+
+
+fence(text=best_practice, wrapper="***")
+
+
Note: the paste() function has an argument
+called sep, which specifies the separator between text. The
+default is a space: ” “. The default for paste0() is no
+space”“.
+
+
+
+
+
+
+
+
+
Write a function called fence() that takes two vectors
+as arguments, called text and wrapper, and
+prints out the text wrapped with the wrapper:
[1] "*** Write programs for people not computers ***"
+
+
+
+
+
+
+
+
+
+
+
Tip
+
+
+
R has some unique aspects that can be exploited when performing more
+complicated operations. We will not be writing anything that requires
+knowledge of these more advanced concepts. In the future when you are
+comfortable writing functions in R, you can learn more by reading the R
+Language Manual or this chapter from Advanced R Programming by Hadley
+Wickham.
+
+
+
+
+
+
+
+
+
Tip: Testing and documenting
+
+
+
It’s important to both test functions and document them:
+Documentation helps you, and others, understand what the purpose of your
+function is, and how to use it, and its important to make sure that your
+function actually does what you think.
+
When you first start out, your workflow will probably look a lot like
+this:
+
+
Write a function
+
Comment parts of the function to document its behaviour
+
Load in the source file
+
Experiment with it in the console to make sure it behaves as you
+expect
+
Make any necessary bug fixes
+
Rinse and repeat.
+
+
Formal documentation for functions, written in separate
+.Rd files, gets turned into the documentation you see in
+help files. The roxygen2
+package allows R coders to write documentation alongside the function
+code and then process it into the appropriate .Rd files.
+You will want to switch to this more formal method of writing
+documentation when you start writing more complicated R projects. In
+fact, packages are, in essence, bundles of functions with this formal
+documentation. Loading your own functions through
+source("functions.R") is equivalent to loading someone
+else’s functions (or your own one day!) through
+library("package").
+
Formal automated tests can be written using the testthat package.
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
+
Use function to define a new function in R.
+
Use parameters to pass values into functions.
+
Use stopifnot() to flexibly check function arguments in
+R.
You have already seen how to save the most recent plot you create in
+ggplot2, using the command ggsave. As a
+refresher:
+
+
R
+
+
+ggsave("My_most_recent_plot.pdf")
+
+
You can save a plot from within RStudio using the ‘Export’ button in
+the ‘Plot’ window. This will give you the option of saving as a .pdf or
+as .png, .jpg or other image formats.
+
Sometimes you will want to save plots without creating them in the
+‘Plot’ window first. Perhaps you want to make a pdf document with
+multiple pages: each one a different plot, for example. Or perhaps
+you’re looping through multiple subsets of a file, plotting data from
+each subset, and you want to save each plot, but obviously can’t stop
+the loop to click ‘Export’ for each one.
+
In this case you can use a more flexible approach. The function
+pdf creates a new pdf device. You can control the size and
+resolution using the arguments to this function.
+
+
R
+
+
+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country))+
+geom_line()+
+theme(legend.position ="none")
+
+# You then have to make sure to turn off the pdf device!
+
+dev.off()
+
+
Open up this document and have a look.
+
+
+
+
+
+
Challenge 1
+
+
+
Rewrite your ‘pdf’ command to print a second page in the pdf, showing
+a facet plot (hint: use facet_grid) of the same data with
+one panel per continent.
How can I do different calculations on different sets of data?
+
+
+
+
+
+
+
+
Objectives
+
+
To be able to use the split-apply-combine strategy for data
+analysis.
+
+
+
+
+
+
+
Previously we looked at how you can use functions to simplify your
+code. We defined the calcGDP function, which takes the
+gapminder dataset, and multiplies the population and GDP per capita
+column. We also defined additional arguments so we could filter by
+year and country:
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat, year=NULL, country=NULL){
+if(!is.null(year)){
+dat<-dat[dat$year%in%year, ]
+}
+if(!is.null(country)){
+dat<-dat[dat$country%in%country,]
+}
+gdp<-dat$pop*dat$gdpPercap
+
+new<-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
A common task you’ll encounter when working with data, is that you’ll
+want to run calculations on different groups within the data. In the
+above, we were calculating the GDP by multiplying two columns together.
+But what if we wanted to calculated the mean GDP per continent?
+
We could run calcGDP and then take the mean of each
+continent:
But this isn’t very nice. Yes, by using a function, you have
+reduced a substantial amount of repetition. That is
+nice. But there is still repetition. Repeating yourself will cost you
+time, both now and later, and potentially introduce some nasty bugs.
+
We could write a new function that is flexible like
+calcGDP, but this also takes a substantial amount of effort
+and testing to get right.
+
The abstract problem we’re encountering here is know as
+“split-apply-combine”:
+
We want to split our data into groups, in this case
+continents, apply some calculations on that group, then
+optionally combine the results together afterwards.
+
The plyr package
+
+
+
For those of you who have used R before, you might be familiar with
+the apply family of functions. While R’s built in functions
+do work, we’re going to introduce you to another method for solving the
+“split-apply-combine” problem. The plyr package provides a set of
+functions that we find more user friendly for solving this problem.
+
We installed this package in an earlier challenge. Let us load it
+now:
+
+
R
+
+
+library("plyr")
+
+
Plyr has functions for operating on lists,
+data.frames and arrays (matrices, or
+n-dimensional vectors). Each function performs:
+
+
A splitting operation
+
+Apply a function on each split in turn.
+
Recombine output data as a single data object.
+
+
The functions are named based on the data structure they expect as
+input, and the data structure you want returned as output: [a]rray,
+[l]ist, or [d]ata.frame. The first letter corresponds to the input data
+structure, the second letter to the output data structure, and then the
+rest of the function is named “ply”.
+
This gives us 9 core functions **ply. There are an additional three
+functions which will only perform the split and apply steps, and not any
+combine step. They’re named by their input data type and represent null
+output by a _ (see table)
+
Note here that plyr’s use of “array” is different to R’s, an array in
+ply can include a vector or matrix.
+
Each of the xxply functions (daply, ddply,
+llply, laply, …) has the same structure and
+has 4 key features and structure:
+
+
R
+
+
+xxply(.data, .variables, .fun)
+
+
+
The first letter of the function name gives the input type and the
+second gives the output type.
+
.data - gives the data object to be processed
+
.variables - identifies the splitting variables
+
.fun - gives the function to be called on each piece
+
+
Now we can quickly calculate the mean GDP per continent:
continent V1
+1 Africa 20904782844
+2 Americas 379262350210
+3 Asia 227233738153
+4 Europe 269442085301
+5 Oceania 188187105354
+
+
Let us walk through the previous code:
+
+
The ddply function feeds in a data.frame
+(function starts with d) and returns another
+data.frame (2nd letter is a d)
+
the first argument we gave was the data.frame we wanted to operate
+on: in this case the gapminder data. We called calcGDP on
+it first so that it would have the additional gdp column
+added to it.
+
The second argument indicated our split criteria: in this case the
+“continent” column. Note that we gave the name of the column, not the
+values of the column like we had done previously with subsetting. Plyr
+takes care of these implementation details for you.
+
The third argument is the function we want to apply to each grouping
+of the data. We had to define our own short function here: each subset
+of the data gets stored in x, the first argument of our
+function. This is an anonymous function: we haven’t defined it
+elsewhere, and it has no name. It only exists in the scope of our call
+to ddply.
+
+
+
+
+
+
+
Challenge 1
+
+
+
Calculate the average life expectancy per continent. Which has the
+longest? Which has the shortest?
year
+continent 1952 1957 1962 1967 1972
+ Africa 5992294608 7359188796 8784876958 11443994101 15072241974
+ Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+ Asia 34095762661 47267432088 60136869012 84648519224 124385747313
+ Europe 84971341466 109989505140 138984693095 173366641137 218691462733
+ Oceania 54157223944 66826828013 82336453245 105958863585 134112109227
+ year
+continent 1977 1982 1987 1992 1997
+ Africa 18694898732 22040401045 24107264108 26256977719 30023173824
+ Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+ Asia 159802590186 194429049919 241784763369 307100497486 387597655323
+ Europe 255367522034 279484077072 316507473546 342703247405 383606933833
+ Oceania 154707711162 176177151380 209451563998 236319179826 289304255183
+ year
+continent 2002 2007
+ Africa 35303511424 45778570846
+ Americas 661248623419 776723426068
+ Asia 458042336179 627513635079
+ Europe 436448815097 493183311052
+ Oceania 345236880176 403657044512
+
+
You can use these functions in place of for loops (and
+it is usually faster to do so). To replace a for loop, put the code that
+was in the body of the for loop inside an anonymous
+function.
+
+
R
+
+
+d_ply(
+ .data=gapminder,
+ .variables ="continent",
+ .fun =function(x){
+meanGDPperCap<-mean(x$gdpPercap)
+print(paste(
+"The mean GDP per capita for", unique(x$continent),
+"is", format(meanGDPperCap, big.mark=",")
+))
+}
+)
+
+
+
OUTPUT
+
+
[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+
+
+
+
+
+
+
Tip: printing numbers
+
+
+
The format function can be used to make numeric values
+“pretty” for printing out in messages.
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Calculate the average life expectancy per continent and year. Which
+had the longest and shortest in 2007? Which had the greatest change in
+between 1952 and 2007?
How can I manipulate data frames without repeating myself?
+
+
+
+
+
+
+
+
Objectives
+
+
To be able to use the six main data frame manipulation ‘verbs’ with
+pipes in dplyr.
+
To understand how group_by() and
+summarize() can be combined to summarize datasets.
+
Be able to analyze a subset of data using logical filtering.
+
+
+
+
+
+
+
Manipulation of data frames means many things to many researchers: we
+often select certain observations (rows) or variables (columns), we
+often group the data by a certain variable(s), or we even calculate
+summary statistics. We can do these operations using the normal base R
+operations:
But this isn’t very nice because there is a fair bit of
+repetition. Repeating yourself will cost you time, both now and later,
+and potentially introduce some nasty bugs.
+
The dplyr package
+
+
+
Luckily, the dplyr
+package provides a number of very useful functions for manipulating data
+frames in a way that will reduce the above repetition, reduce the
+probability of making errors, and probably even save you some typing. As
+an added bonus, you might even find the dplyr grammar
+easier to read.
+
+
+
+
+
+
Tip: Tidyverse
+
+
+
dplyr package belongs to a broader family of opinionated
+R packages designed for data science called the “Tidyverse”. These
+packages are specifically designed to work harmoniously together. Some
+of these packages will be covered along this course, but you can find
+more complete information here: https://www.tidyverse.org/.
+
+
+
+
Here we’re going to cover 5 of the most commonly used functions as
+well as using pipes (%>%) to combine them.
+
+
select()
+
filter()
+
group_by()
+
summarize()
+
mutate()
+
+
If you have have not installed this package earlier, please do
+so:
+
+
R
+
+
+install.packages('dplyr')
+
+
Now let’s load the package:
+
+
R
+
+
+library("dplyr")
+
+
Using select()
+
+
+
If, for example, we wanted to move forward with only a few of the
+variables in our data frame we could use the select()
+function. This will keep only the variables you select.
If we open up year_country_gdp we’ll see that it only
+contains the year, country and gdpPercap. Above we used ‘normal’
+grammar, but the strengths of dplyr lie in combining
+several functions using pipes. Since the pipes grammar is unlike
+anything we’ve seen in R before, let’s repeat what we’ve done above
+using pipes.
To help you understand why we wrote that in that way, let’s walk
+through it step by step. First we summon the gapminder data frame and
+pass it on, using the pipe symbol %>%, to the next step,
+which is the select() function. In this case we don’t
+specify which data object we use in the select() function
+since in gets that from the previous pipe. Fun Fact:
+There is a good chance you have encountered pipes before in the shell.
+In R, a pipe symbol is %>% while in the shell it is
+| but the concept is the same!
+
+
+
+
+
+
Tip: Renaming data frame columns in dplyr
+
+
+
In Chapter 4 we covered how you can rename columns with base R by
+assigning a value to the output of the names() function.
+Just like select, this is a bit cumbersome, but thankfully dplyr has a
+rename() function.
+
Within a pipeline, the syntax is
+rename(new_name = old_name). For example, we may want to
+rename the gdpPercap column name from our select()
+statement above.
Write a single command (which can span multiple lines and includes
+pipes) that will produce a data frame that has the African values for
+lifeExp, country and year, but
+not for other Continents. How many rows does your data frame have and
+why?
As with last time, first we pass the gapminder data frame to the
+filter() function, then we pass the filtered version of the
+gapminder data frame to the select() function.
+Note: The order of operations is very important in this
+case. If we used ‘select’ first, filter would not be able to find the
+variable continent since we would have removed it in the previous
+step.
+
Using group_by()
+
+
+
Now, we were supposed to be reducing the error prone repetitiveness
+of what can be done with base R, but up to now we haven’t done that
+since we would have to repeat the above for each continent. Instead of
+filter(), which will only pass observations that meet your
+criteria (in the above: continent=="Europe"), we can use
+group_by(), which will essentially use every unique
+criteria that you could have used in filter.
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
You will notice that the structure of the data frame where we used
+group_by() (grouped_df) is not the same as the
+original gapminder (data.frame). A
+grouped_df can be thought of as a list where
+each item in the listis a data.frame which
+contains only the rows that correspond to the a particular value
+continent (at least in the example above).
+
Using summarize()
+
+
+
The above was a bit on the uneventful side but
+group_by() is much more exciting in conjunction with
+summarize(). This will allow us to create new variable(s)
+by using functions that repeat for each of the continent-specific data
+frames. That is to say, using the group_by() function, we
+split our original data frame into multiple pieces, then we can run
+functions (e.g. mean() or sd()) within
+summarize().
# A tibble: 2 × 2
+ country mean_lifeExp
+ <chr> <dbl>
+1 Iceland 76.5
+2 Sierra Leone 36.8
+
+
Another way to do this is to use the dplyr function
+arrange(), which arranges the rows in a data frame
+according to the order of one or more variables from the data frame. It
+has similar syntax to other functions from the dplyr
+package. You can use desc() inside arrange()
+to sort in descending order.
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
count() and n()
+
+
+
A very common operation is to count the number of observations for
+each group. The dplyr package comes with two related
+functions that help with this.
+
For instance, if we wanted to check the number of countries included
+in the dataset for the year 2002, we can use the count()
+function. It takes the name of one or more columns that contain the
+groups we are interested in, and we can optionally sort the results in
+descending order by adding sort=TRUE:
continent n
+1 Africa 52
+2 Asia 33
+3 Europe 30
+4 Americas 25
+5 Oceania 2
+
+
If we need to use the number of observations in calculations, the
+n() function is useful. It will return the total number of
+observations in the current group rather than counting the number of
+observations in each group within a specific column. For instance, if we
+wanted to get the standard error of the life expectency per
+continent:
# A tibble: 5 × 2
+ continent se_le
+ <chr> <dbl>
+1 Africa 0.366
+2 Americas 0.540
+3 Asia 0.596
+4 Europe 0.286
+5 Oceania 0.775
+
+
You can also chain together several summary operations; in this case
+calculating the minimum, maximum,
+mean and se of each continent’s per-country
+life-expectancy:
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
Connect mutate with logical filtering: ifelse
+
+
+
When creating new variables, we can hook this with a logical
+condition. A simple combination of mutate() and
+ifelse() facilitates filtering right where it is needed: in
+the moment of creating something new. This easy-to-read statement is a
+fast and powerful way of discarding certain data (even though the
+overall dimension of the data frame will not change) or for updating
+values depending on this given condition.
+
+
R
+
+
+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25<-gapminder%>%
+mutate(gdp_billion =ifelse(lifeExp>25, gdpPercap*pop/10^9, NA))%>%
+group_by(continent, year)%>%
+summarize(mean_gdpPercap =mean(gdpPercap),
+ sd_gdpPercap =sd(gdpPercap),
+ mean_pop =mean(pop),
+ sd_pop =sd(pop),
+ mean_gdp_billion =mean(gdp_billion),
+ sd_gdp_billion =sd(gdp_billion))
+
+
+
OUTPUT
+
+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+
R
+
+
+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp<-gapminder%>%
+mutate(gdp_futureExpectation =ifelse(lifeExp>40, gdpPercap*1.5, gdpPercap))%>%
+group_by(continent, year)%>%
+summarize(mean_gdpPercap =mean(gdpPercap),
+ mean_gdpPercap_expected =mean(gdp_futureExpectation))
+
+
+
OUTPUT
+
+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
Combining dplyr and ggplot2
+
+
+
First install and load ggplot2:
+
+
R
+
+
+install.packages('ggplot2')
+
+
+
R
+
+
+library("ggplot2")
+
+
In the plotting lesson we looked at how to make a multi-panel figure
+by adding a layer of facet panels using ggplot2. Here is
+the code we used (with some extra comments):
+
+
R
+
+
+# Filter countries located in the Americas
+americas<-gapminder[gapminder$continent=="Americas", ]
+# Make the plot
+ggplot(data =americas, mapping =aes(x =year, y =lifeExp))+
+geom_line()+
+facet_wrap(~country)+
+theme(axis.text.x =element_text(angle =45))
+
+
This code makes the right plot but it also creates an intermediate
+variable (americas) that we might not have any other uses
+for. Just as we used %>% to pipe data along a chain of
+dplyr functions we can use it to pass data to
+ggplot(). Because %>% replaces the first
+argument in a function we don’t need to specify the data =
+argument in the ggplot() function. By combining
+dplyr and ggplot2 functions we can make the
+same figure without creating any new variables or modifying the
+data.
+
+
R
+
+
+gapminder%>%
+# Filter countries located in the Americas
+filter(continent=="Americas")%>%
+# Make the plot
+ggplot(mapping =aes(x =year, y =lifeExp))+
+geom_line()+
+facet_wrap(~country)+
+theme(axis.text.x =element_text(angle =45))
+
+
More examples of using the function mutate() and the
+ggplot2 package.
+
+
R
+
+
+gapminder%>%
+# extract first letter of country name into new column
+mutate(startsWith =substr(country, 1, 1))%>%
+# only keep countries starting with A or Z
+filter(startsWith%in%c("A", "Z"))%>%
+# plot lifeExp into facets
+ggplot(aes(x =year, y =lifeExp, colour =continent))+
+geom_line()+
+facet_wrap(vars(country))+
+theme_minimal()
+
+
+
+
+
+
+
Advanced Challenge
+
+
+
Calculate the average life expectancy in 2002 of 2 randomly selected
+countries for each continent. Then arrange the continent names in
+reverse order. Hint: Use the dplyr
+functions arrange() and sample_n(), they have
+similar syntax to other dplyr functions.
To understand the concepts of ‘longer’ and ‘wider’ data frame
+formats and be able to convert between them with
+tidyr.
+
+
+
+
+
+
+
Researchers often want to reshape their data frames from ‘wide’ to
+‘longer’ layouts, or vice-versa. The ‘long’ layout or format is
+where:
+
+
each column is a variable
+
each row is an observation
+
+
In the purely ‘long’ (or ‘longest’) format, you usually have 1 column
+for the observed variable and the other columns are ID variables.
+
For the ‘wide’ format each row is often a site/subject/patient and
+you have multiple observation variables containing the same type of
+data. These can be either repeated observations over time, or
+observation of multiple variables (or a mix of both). You may find data
+input may be simpler or some other applications may prefer the ‘wide’
+format. However, many of R‘s functions have been designed
+assuming you have ’longer’ formatted data. This tutorial will help you
+efficiently transform your data shape regardless of original format.
+
Long and wide data frame layouts mainly affect readability. For
+humans, the wide format is often more intuitive since we can often see
+more of the data on the screen due to its shape. However, the long
+format is more machine readable and is closer to the formatting of
+databases. The ID variables in our data frames are similar to the fields
+in a database and observed variables are like the database values.
+
Getting started
+
+
+
First install the packages if you haven’t already done so (you
+probably installed dplyr in the previous lesson):
First, lets look at the structure of our original gapminder data
+frame:
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+
+
+
+
+
+
+
Challenge 1
+
+
+
Is gapminder a purely long, purely wide, or some intermediate
+format?
+
+
+
+
+
+
+
+
+
The original gapminder data.frame is in an intermediate format. It is
+not purely long since it had multiple observation variables
+(pop,lifeExp,gdpPercap).
+
+
+
+
+
Sometimes, as with the gapminder dataset, we have multiple types of
+observed data. It is somewhere in between the purely ‘long’ and ‘wide’
+data formats. We have 3 “ID variables” (continent,
+country, year) and 3 “Observation variables”
+(pop,lifeExp,gdpPercap). This
+intermediate format can be preferred despite not having ALL observations
+in 1 column given that all 3 observation variables have different units.
+There are few operations that would need us to make this data frame any
+longer (i.e. 4 ID variables and 1 Observation variable).
+
While using many of the functions in R, which are often vector based,
+you usually do not want to do mathematical operations on values with
+different units. For example, using the purely long format, a single
+mean for all of the values of population, life expectancy, and GDP would
+not be meaningful since it would return the mean of values with 3
+incompatible units. The solution is that we first manipulate the data
+either by grouping (see the lesson on dplyr), or we change
+the structure of the data frame. Note: Some plotting
+functions in R actually work better in the wide format data.
+
From wide to long format with pivot_longer()
+
+
+
Until now, we’ve been using the nicely formatted original gapminder
+dataset, but ‘real’ data (i.e. our own research data) will never be so
+well organized. Here let’s start with the wide formatted version of the
+gapminder dataset.
+
+
Download the wide version of the gapminder data from here and save it in your data
+folder.
+
+
We’ll load the data file and look at it. Note: we don’t want our
+continent and country columns to be factors, so we use the
+stringsAsFactors argument for read.csv() to disable
+that.
To change this very wide data frame layout back to our nice,
+intermediate (or longer) layout, we will use one of the two available
+pivot functions from the tidyr package. To
+convert from wide to a longer format, we will use the
+pivot_longer() function. pivot_longer() makes
+datasets longer by increasing the number of rows and decreasing the
+number of columns, or ‘lengthening’ your observation variables into a
+single variable.
Here we have used piping syntax which is similar to what we were
+doing in the previous lesson with dplyr. In fact, these are compatible
+and you can use a mix of tidyr and dplyr functions by piping them
+together.
+
We first provide to pivot_longer() a vector of column
+names that will be pivoted into longer format. We could type out all the
+observation variables, but as in the select() function (see
+dplyr lesson), we can use the starts_with()
+argument to select all variables that start with the desired character
+string. pivot_longer() also allows the alternative syntax
+of using the - symbol to identify which variables are not
+to be pivoted (i.e. ID variables).
+
The next arguments to pivot_longer() are
+names_to for naming the column that will contain the new ID
+variable (obstype_year) and values_to for
+naming the new amalgamated observation variable
+(obs_value). We supply these new column names as
+strings.
That may seem trivial with this particular data frame, but sometimes
+you have 1 ID variable and 40 observation variables with irregular
+variable names. The flexibility is a huge time saver!
+
Now obstype_year actually contains 2 pieces of
+information, the observation type
+(pop,lifeExp, or gdpPercap) and
+the year. We can use the separate() function
+to split the character strings into multiple variables
+
+
R
+
+
+gap_long<-gap_long%>%separate(obstype_year, into =c('obs_type', 'year'), sep ="_")
+gap_long$year<-as.integer(gap_long$year)
+
+
+
+
+
+
+
Challenge 2
+
+
+
Using gap_long, calculate the mean life expectancy,
+population, and gdpPercap for each continent. Hint: use
+the group_by() and summarize() functions we
+learned in the dplyr lesson
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+
OUTPUT
+
+
# A tibble: 15 × 3
+# Groups: continent [5]
+ continent obs_type means
+ <chr> <chr> <dbl>
+ 1 Africa gdpPercap 2194.
+ 2 Africa lifeExp 48.9
+ 3 Africa pop 9916003.
+ 4 Americas gdpPercap 7136.
+ 5 Americas lifeExp 64.7
+ 6 Americas pop 24504795.
+ 7 Asia gdpPercap 7902.
+ 8 Asia lifeExp 60.1
+ 9 Asia pop 77038722.
+10 Europe gdpPercap 14469.
+11 Europe lifeExp 71.9
+12 Europe pop 17169765.
+13 Oceania gdpPercap 18622.
+14 Oceania lifeExp 74.3
+15 Oceania pop 8874672.
+
+
+
+
+
+
From long to intermediate format with pivot_wider()
+
+
+
It is always good to check work. So, let’s use the second
+pivot function, pivot_wider(), to ‘widen’ our
+observation variables back out. pivot_wider() is the
+opposite of pivot_longer(), making a dataset wider by
+increasing the number of columns and decreasing the number of rows. We
+can use pivot_wider() to pivot or reshape our
+gap_long to the original intermediate format or the widest
+format. Let’s start with the intermediate format.
+
The pivot_wider() function takes names_from
+and values_from arguments.
+
To names_from we supply the column name whose contents
+will be pivoted into new output columns in the widened data frame. The
+corresponding values will be added from the column named in the
+values_from argument.
Now we’ve got an intermediate data frame gap_normal with
+the same dimensions as the original gapminder, but the
+order of the variables is different. Let’s fix that before checking if
+they are all.equal().
That’s great! We’ve gone from the longest format back to the
+intermediate and we didn’t introduce any errors in our code.
+
Now let’s convert the long all the way back to the wide. In the wide
+format, we will keep country and continent as ID variables and pivot the
+observations across the 3 metrics
+(pop,lifeExp,gdpPercap) and time
+(year). First we need to create appropriate labels for all
+our new variables (time*metric combinations) and we also need to unify
+our ID variables to simplify the process of defining
+gap_wide.
Using unite() we now have a single ID variable which is
+a combination of continent,country,and we have
+defined variable names. We’re now ready to pipe in
+pivot_wider()
Take this 1 step further and create a
+gap_ludicrously_wide format data by pivoting over
+countries, year and the 3 metrics? Hint this new data
+frame should only have 5 rows.
Understand the value of writing reproducible reports
+
Learn how to recognise and compile the basic components of an R
+Markdown file
+
Become familiar with R code chunks, and understand their purpose,
+structure and options
+
Demonstrate the use of inline chunks for weaving R outputs into text
+blocks, for example when discussing the results of some
+calculations
+
Be aware of alternative output formats to which an R Markdown file
+can be exported
+
+
+
+
+
+
+
Data analysis reports
+
+
+
Data analysts tend to write a lot of reports, describing their
+analyses and results, for their collaborators or to document their work
+for future reference.
+
Many new users begin by first writing a single R script containing
+all of their work, and then share the analysis by emailing the script
+and various graphs as attachments. But this can be cumbersome, requiring
+a lengthy discussion to explain which attachment was which result.
+
Writing formal reports with Word or LaTeX can simplify this
+process by incorporating both the analysis report and output graphs into
+a single document. But tweaking formatting to make figures look correct
+and fixing obnoxious page breaks can be tedious and lead to a lengthy
+“whack-a-mole” game of fixing new mistakes resulting from a single
+formatting change.
+
Creating a report as a web page (which is an html file) using R
+Markdown makes things easier. The report can be one long stream, so tall
+figures that wouldn’t ordinarily fit on one page can be kept at full
+size and easier to read, since the reader can simply keep scrolling.
+Additionally, the formatting of and R Markdown document is simple and
+easy to modify, allowing you to spend more time on your analyses instead
+of writing reports.
+
Literate programming
+
+
+
Ideally, such analysis reports are reproducible documents:
+If an error is discovered, or if some additional subjects are added to
+the data, you can just re-compile the report and get the new or
+corrected results rather than having to reconstruct figures, paste them
+into a Word document, and hand-edit various detailed results.
+
The key R package here is knitr. It allows you
+to create a document that is a mixture of text and chunks of code. When
+the document is processed by knitr, chunks of code will be
+executed, and graphs or other results will be inserted into the final
+document.
+
This sort of idea has been called “literate programming”.
+
knitr allows you to mix basically any type of text with
+code from different programming languages, but we recommend that you use
+R Markdown, which mixes Markdown with R. Markdown is a light-weight
+mark-up language for creating web pages.
+
Creating an R Markdown file
+
+
+
Within RStudio, click File → New File → R Markdown and you’ll get a
+dialog box like this:
+
You can stick with the default (HTML output), but give it a
+title.
+
Basic components of R Markdown
+
+
+
The initial chunk of text (header) contains instructions for R to
+specify what kind of document will be created, and the options chosen.
+You can use the header to give your document a title, author, date, and
+tell it what type of output you want to produce. In this case, we’re
+creating an html document.
You can delete any of those fields if you don’t want them included.
+The double-quotes aren’t strictly necessary in this case.
+They’re mostly needed if you want to include a colon in the title.
+
RStudio creates the document with some example text to get you
+started. Note below that there are chunks like
+
+```{r}
+summary(cars)
+```
+
+
These are chunks of R code that will be executed by
+knitr and replaced by their results. More on this
+later.
+
Markdown
+
+
+
Markdown is a system for writing web pages by marking up the text
+much as you would in an email rather than writing html code. The
+marked-up text gets converted to html, replacing the marks with
+the proper html code.
+
For now, let’s delete all of the stuff that’s there and write a bit
+of markdown.
+
You make things bold using two asterisks, like this:
+**bold**, and you make things italics by using
+underscores, like this: _italics_.
+
You can make a bulleted list by writing a list with hyphens or
+asterisks with a space between the list and other text, like this:
+
A list:
+
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks
+
or like this:
+
A second list:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+
Each will appear as:
+
+
bold with double-asterisks
+
italics with underscores
+
code-type font with backticks
+
+
You can use whatever method you prefer, but be consistent.
+This maintains the readability of your code.
+
You can make a numbered list by just using numbers. You can even use
+the same number over and over if you want:
+
1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks
+
This will appear as:
+
+
bold with double-asterisks
+
italics with underscores
+
code-type font with backticks
+
+
You can make section headers of different sizes by initiating a line
+with some number of # symbols:
+
# Title
+## Main section
+### Sub-section
+#### Sub-sub section
+
You compile the R Markdown document to an html webpage by
+clicking the “Knit” button in the upper-left.
+
+
+
+
+
+
Challenge 1
+
+
+
Create a new R Markdown document. Delete all of the R code chunks and
+write a bit of Markdown (some sections, some italicized text, and an
+itemized list).
+
Convert the document to a webpage.
+
+
+
+
+
+
+
+
+
In RStudio, select File > New file > R Markdown…
+
Delete the placeholder text and add the following:
+
# Introduction
+
+## Background on Data
+
+This report uses the *gapminder* dataset, which has columns that include:
+
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+
+## Background on Methods
+
+
Then click the ‘Knit’ button on the toolbar to generate an html
+document (webpage).
+
+
+
+
+
A bit more Markdown
+
+
+
You can make a hyperlink like this:
+[Carpentries Home Page](https://carpentries.org/).
+
You can include an image file like this:
+![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)
+
You can do subscripts (e.g., F2) with F~2~
+and superscripts (e.g., F2) with F^2^.
+
If you know how to write equations in LaTeX, you can use
+$ $ and $$ $$ to insert math equations, like
+$E = mc^2$ and
+
$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$
+
You can review Markdown syntax by navigating to the “Markdown Quick
+Reference” under the “Help” field in the toolbar at the top of
+RStudio.
+
R code chunks
+
+
+
The real power of Markdown comes from mixing markdown with chunks of
+code. This is R Markdown. When processed, the R code will be executed;
+if they produce figures, the figures will be inserted in the final
+document.
+
The main code chunks look like this:
+
+```{r load_data}
+gapminder
+
That is, you place a chunk of R code between ```{r
+chunk_name} and ```. You should give each chunk a
+unique name, as they will help you to fix errors and, if any graphs are
+produced, the file names are based on the name of the code chunk that
+produced them. You can create code chunks quickly in RStudio using the
+shortcuts Ctrl+Alt+I on Windows and
+Linux, or Cmd+Option+I on Mac.
+
+
+
+
+
+
Challenge 2
+
+
+
Add code chunks to:
+
+
Load the ggplot2 package
+
Read the gapminder data
+
Create a plot
+
+
+
+
+
+
+
+
+
+
+```{r load-ggplot2}
+library("ggplot2")
+```
+
+
+```{r read-gapminder-data}
+gapminder
+
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+
+
+
+
+
+
How things get compiled
+
+
+
When you press the “Knit” button, the R Markdown document is
+processed by knitr
+and a plain Markdown document is produced (as well as, potentially, a
+set of figure files): the R code is executed and replaced by both the
+input and the output; if figures are produced, links to those figures
+are included.
+
The Markdown and figure documents are then processed by the tool pandoc, which converts the
+Markdown file into an html file, with the figures embedded.
+
Chunk options
+
+
+
There are a variety of options to affect how the code chunks are
+treated. Here are some examples:
+
+
Use echo=FALSE to avoid having the code itself
+shown.
+
Use results="hide" to avoid having any results
+printed.
+
Use eval=FALSE to have the code shown but not
+evaluated.
+
Use warning=FALSE and message=FALSE to
+hide any warnings or messages produced.
+
Use fig.height and fig.width to control
+the size of the figures produced (in inches).
The fig.path option defines where the figures will be
+saved. The / here is really important; without it, the
+figures would be saved in the standard place but just with names that
+begin with Figs.
+
If you have multiple R Markdown files in a common directory, you
+might want to use fig.path to define separate prefixes for
+the figure file names, like fig.path="Figs/cleaning-" and
+fig.path="Figs/analysis-".
+
+
+
+
+
+
Challenge 3
+
+
+
Use chunk options to control the size of a figure and to hide the
+code.
You can review all of the R chunk options by navigating
+to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the
+“Help” field in the toolbar at the top of RStudio.
+
Inline R code
+
+
+
You can make every number in your report reproducible. Use
+`r and ` for an in-line code chunk, like so:
+`r round(some_value, 2)`. The code will be executed and
+replaced with the value of the result.
+
Don’t let these in-line chunks get split across lines.
+
Perhaps precede the paragraph with a larger code chunk that does
+calculations and defines variables, with include=FALSE for
+that larger chunk (which is the same as echo=FALSE and
+results="hide").
+
Rounding can produce differences in output in such situations. You
+may want 2.0, but round(2.03, 1) will give
+just 2.
+
The myround
+function in the R/broman
+package handles this.
+
+
+
+
+
+
Challenge 4
+
+
+
Try out a bit of in-line R code.
+
+
+
+
+
+
+
+
+
Here’s some inline code to determine that 2 + 2 = 4.
+
+
+
+
+
Other output options
+
+
+
You can also convert R Markdown to a PDF or a Word document. Click
+the little triangle next to the “Knit” button to get a drop-down menu.
+Or you could put pdf_document or word_document
+in the initial header of the file.
+
+
+
+
+
+
Tip: Creating PDF documents
+
+
+
Creating .pdf documents may require installation of some extra
+software. The R package tinytex provides some tools to help
+make this process easier for R users. With tinytex
+installed, run tinytex::install_tinytex() to install the
+required software (you’ll only need to do this once) and then when you
+knit to pdf tinytex will automatically detect and install
+any additional LaTeX packages that are needed to produce the pdf
+document. Visit the tinytex
+website for more information.
+
+
+
+
+
+
+
+
+
Tip: Visual markdown editing in RStudio
+
+
+
RStudio versions 1.4 and later include visual markdown editing mode.
+In visual editing mode, markdown expressions (like
+**bold words**) are transformed to the formatted appearance
+(bold words) as you type. This mode also includes a
+toolbar at the top with basic formatting buttons, similar to what you
+might see in common word processing software programs. You can turn
+visual editing on and off by pressing the button in the top right corner of your
+R Markdown document.
How can I write software that other people can use?
+
+
+
+
+
+
+
+
Objectives
+
+
Describe best practices for writing R and explain the justification
+for each.
+
+
+
+
+
+
+
Structure your project folder
+
+
+
Keep your project folder structured, organized and tidy, by creating
+subfolders for your code files, manuals, data, binaries, output plots,
+etc. It can be done completely manually, or with the help of RStudio’s
+New Project functionality, or a designated package, such as
+ProjectTemplate.
+
+
+
+
+
+
Tip: ProjectTemplate - a possible
+solution
+
+
+
One way to automate the management of projects is to install the
+third-party package, ProjectTemplate. This package will set
+up an ideal directory structure for project management. This is very
+useful as it enables you to have your analysis pipeline/workflow
+organised and structured. Together with the default RStudio project
+functionality and Git you will be able to keep track of your work as
+well as be able to share your work with collaborators.
For more information on ProjectTemplate and its functionality visit
+the home page ProjectTemplate
+
+
+
+
Make code readable
+
+
+
The most important part of writing code is making it readable and
+understandable. You want someone else to be able to pick up your code
+and be able to understand what it does: more often than not this someone
+will be you 6 months down the line, who will otherwise be cursing
+past-self.
+
Documentation: tell us what and why, not how
+
+
+
When you first start out, your comments will often describe what a
+command does, since you’re still learning yourself and it can help to
+clarify concepts and remind you later. However, these comments aren’t
+particularly useful later on when you don’t remember what problem your
+code is trying to solve. Try to also include comments that tell you
+why you’re solving a problem, and what problem that
+is. The how can come after that: it’s an implementation detail
+you ideally shouldn’t have to worry about.
+
Keep your code modular
+
+
+
Our recommendation is that you should separate your functions from
+your analysis scripts, and store them in a separate file that you
+source when you open the R session in your project. This
+approach is nice because it leaves you with an uncluttered analysis
+script, and a repository of useful functions that can be loaded into any
+analysis script in your project. It also lets you group related
+functions together easily.
+
Break down problem into bite size pieces
+
+
+
When you first start out, problem solving and function writing can be
+daunting tasks, and hard to separate from code inexperience. Try to
+break down your problem into digestible chunks and worry about the
+implementation details later: keep breaking down the problem into
+smaller and smaller functions until you reach a point where you can code
+a solution, and build back up from there.
+
Know that your code is doing the right thing
+
+
+
Make sure to test your functions!
+
Don’t repeat yourself
+
+
+
Functions enable easy reuse within a project. If you see blocks of
+similar lines of code through your project, those are usually candidates
+for being moved into functions.
+
If your calculations are performed through a series of functions,
+then the project becomes more modular and easier to change. This is
+especially the case for which a particular input always gives a
+particular output.
+
Remember to be stylish
+
+
+
Apply consistent style to your code.
+
+
+
+
+
+
Keypoints
+
+
+
+
Keep your project folder structured, organized and tidy.
+
Document what and why, not how.
+
Break programs into short single-purpose functions.
+
Write re-runnable tests.
+
Don’t repeat yourself.
+
Be consistent in naming, indentation, and other aspects of
+style.
Image 1 of 1: ‘Blank plot, before adding any mapping aesthetics to ggplot().’
+
+
Figure 2
+
Image 1 of 1: ‘Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.’
+
+
Figure 3
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita, now showing the data points.’
+
+
Figure 4
+
Image 1 of 1: ‘Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time’
+
+
Figure 5
+
Image 1 of 1: ‘Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function’
+
+
Figure 6
+
+
Figure 7
+
+
Figure 8
+
+
Figure 9
+
+
Figure 10
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.’
+
+
Figure 11
+
+
Figure 12
+
Image 1 of 1: ‘Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread’
+
+
Figure 13
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.’
+
+
Figure 14
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.’
+
+
Figure 15
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.’
Image 1 of 1: ‘Screenshot of the New R Markdown file dialogue box in RStudio’
+
+
Figure 2
+
+
Figure 3
+
RStudio versions 1.4 and later include visual markdown editing mode.
+In visual editing mode, markdown expressions (like
+**bold words**) are transformed to the formatted appearance
+(bold words) as you type. This mode also includes a
+toolbar at the top with basic formatting buttons, similar to what you
+might see in common word processing software programs. You can turn
+visual editing on and off by pressing the button in the top right corner of your
+R Markdown document.
+
+
+
+
diff --git a/index.html b/index.html
new file mode 100644
index 000000000..e07b6bb42
--- /dev/null
+++ b/index.html
@@ -0,0 +1,464 @@
+
+R for Reproducible Scientific Analysis: Summary and Setup
+ Skip to main content
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ R for Reproducible Scientific Analysis
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Summary and Setup
+
+
+
an introduction to R for non-programmers using gapminder
+data
+
The goal of this lesson is to teach novice programmers to write
+modular code and best practices for using R for data analysis. R is
+commonly used in many scientific disciplines for statistical analysis
+and its array of third-party packages. We find that many scientists who
+come to Software Carpentry workshops use R and want to learn more. The
+emphasis of these materials is to give attendees a strong foundation in
+the fundamentals of R, and to teach best practices for scientific
+computing: breaking down analyses into modular units, task automation,
+and encapsulation.
+
Note that this workshop will focus on teaching the fundamentals of
+the programming language R, and will not teach statistical analysis.
+
The lesson contains more material than can be taught in a day. The instructor notes page has some
+suggested lesson plans suitable for a one or half day workshop.
+
A variety of third party packages are used throughout this workshop.
+These are not necessarily the best, nor are they comprehensive, but they
+are packages we find useful, and have been chosen primarily for their
+usability.
+
+
+
+
+
+
Prerequisites
+
+
+
Understand that computers store data and instructions (programs,
+scripts etc.) in files. Files are organised in directories (folders).
+Know how to access files not in the working directory by specifying the
+path.
+
+
+
+
+
+
This lesson assumes you have R and RStudio installed on your
+computer.
+Download
+and install RStudio. RStudio is an application (an integrated
+development environment or IDE) that facilitates the use of R and offers
+a number of nice additional features. You will need the free Desktop
+version for your computer.
+
+
diff --git a/instructor-notes.html b/instructor-notes.html
new file mode 100644
index 000000000..ba32de996
--- /dev/null
+++ b/instructor-notes.html
@@ -0,0 +1,629 @@
+
+
+
+
+
+R for Reproducible Scientific Analysis: Instructor Notes
+
+
+
+
+
+
+
+
+
+
+
+ Skip to main content
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ R for Reproducible Scientific Analysis
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Instructor Notes
+
+
+
Timing
+
+
+
Leave about 30 minutes at the start of each workshop and another 15
+mins at the start of each session for technical difficulties like WiFi
+and installing things (even if you asked students to install in advance,
+longer if not).
+
Lesson Plans
+
+
+
The lesson contains much more material than can be taught in a day.
+Instructors will need to pick an appropriate subset of episodes to use
+in a standard one day course.
08 Creating Publication-Quality Graphics with ggplot2 OR 13
+Dataframe Manipulation with dplyr
+
15 Producing Reports With knitr
+
+
A half day course could consist of (suggested by @karawoo):
+
+
01 Introduction to R and RStudio
+
04 Data Structures (only creating vectors with
+c())
+
05 Exploring Data Frames (“Realistic example” section onwards)
+
06 Subsetting Data (excluding factor, matrix and list
+subsetting)
+
08 Creating Publication-Quality Graphics with ggplot2
+
Setting up git in RStudio
+
+
+
There can be difficulties linking git to RStudio depending on the
+operating system and the version of the operating system. To make sure
+Git is properly installed and configured, the learners should go to the
+Options window in the RStudio application.
+
+
+Mac OS X:
+
+
Go RStudio -> Preferences… -> Git/SVN
+
Check and see whether there is a path to a file in the “Git
+executable” window. If not, the next challenge is figuring out where Git
+is located.
+
In the terminal enter which git and you will get a path
+to the git executable. In the “Git executable” window you may have
+difficulties finding the directory since OS X hides many of the
+operating system files. While the file selection window is open,
+pressing “Command-Shift-G” will pop up a text entry box where you will
+be able to type or paste in the full path to your git executable:
+e.g. /usr/bin/git or whatever else it might be.
+
+
+
+Windows:
+
+
Go Tools -> Global options… -> Git/SVN
+
If you use the Software Carpentry Installer, then ‘git.exe’ should
+be installed at C:/Program Files/Git/bin/git.exe.
+
+
+
+
To prevent the learners from having to re-enter their password each
+time they push a commit to GitHub, this command (which can be run from a
+bash prompt) will make it so they only have to enter their password
+once:
The easiest way to get the data used in this lesson during a workshop
+is to have attendees download the raw data from gapminder-data and gapminder-data-wide.
+
Attendees can use the File - Save As dialog in their
+browser to save the file.
+
Overall
+
+
+
Make sure to emphasize good practices: put code in scripts, and make
+sure they’re version controlled. Encourage students to create script
+files for challenges.
+
If you’re working in a cloud environment, get them to upload the
+gapminder data after the second lesson.
+
Make sure to emphasize that matrices are vectors underneath the hood
+and data frames are lists underneath the hood: this will explain a lot
+of the esoteric behaviour encountered in basic operations.
+
Vector recycling and function stacks are probably best explained with
+diagrams on a whiteboard.
+
Be sure to actually go through examples of an R help page: help files
+can be intimidating at first, but knowing how to read them is
+tremendously useful.
+
Be sure to show the CRAN task views, look at one of the topics.
+
There’s a lot of content: move quickly through the earlier lessons.
+Their extensiveness is mostly for purposes of learning by osmosis: so
+that their memory will trigger later when they encounter a problem or
+some esoteric behaviour.
+
Key lessons to take time on:
+
+
Data subsetting - conceptually difficult for novices
+
Functions - learners especially struggle with this
+
Data structures - worth being thorough, but you can go through it
+quickly.
+
+
Don’t worry about being correct or knowing the material
+back-to-front. Use mistakes as teaching moments: the most vital skill
+you can impart is how to debug and recover from unexpected errors.
+
+
+
+
diff --git a/instructor/01-rstudio-intro.html b/instructor/01-rstudio-intro.html
new file mode 100644
index 000000000..b643c6f02
--- /dev/null
+++ b/instructor/01-rstudio-intro.html
@@ -0,0 +1,1470 @@
+
+R for Reproducible Scientific Analysis: Introduction to R and RStudio
+ Skip to main content
+
Describe the purpose and use of each pane in the RStudio IDE
+
Locate buttons and options in the RStudio IDE
+
Define a variable
+
Assign data to a variable
+
Manage a workspace in an interactive R session
+
Use mathematical and comparison operators
+
Call functions
+
Manage packages
+
+
+
+
+
+
Motivation
+
+
Science is a multi-step process: once you’ve designed an experiment
+and collected data, the real fun begins! This lesson will teach you how
+to start this process using R and RStudio. We will begin with raw data,
+perform exploratory analyses, and learn how to plot results graphically.
+This example starts with a dataset from gapminder.org containing population
+information for many countries through time. Can you read the data into
+R? Can you plot the population for Senegal? Can you calculate the
+average income for countries on the continent of Asia? By the end of
+these lessons you will be able to do things like plot the populations
+for all of these countries in under a minute!
+
Before Starting The Workshop
+
+
Please ensure you have the latest version of R and RStudio installed
+on your machine. This is important, as some packages used in the
+workshop may not install correctly (or at all) if R is not up to
+date.
Welcome to the R portion of the Software Carpentry workshop.
+
Throughout this lesson, we’re going to teach you some of the
+fundamentals of the R language as well as some best practices for
+organizing code for scientific projects that will make your life
+easier.
+
We’ll be using RStudio: a free, open-source R Integrated Development
+Environment (IDE). It provides a built-in editor, works on all platforms
+(including on servers) and provides many advantages such as integration
+with version control and project management.
+
Basic layout
+
When you first open RStudio, you will be greeted by three panels:
+
The interactive R console/Terminal (entire left)
+
Environment/History/Connections (tabbed in upper right)
+
Files/Plots/Packages/Help/Viewer (tabbed in lower right)
+
Once you open files, such as R scripts, an editor panel will also
+open in the top left.
+
+
+
+
+
+
R scripts
+
+
+
Any commands that you write in the R console can be saved to a file
+to be re-run again. Files containing R code to be ran in this way are
+called R scripts. R scripts have .R at the end of their
+names to let you know what they are.
+
+
+
+
Workflow within RStudio
+
+
There are two main ways one can work within RStudio:
+
Test and play within the interactive R console then copy code into a
+.R file to run later.
+
This works well when doing small tests and initially starting
+off.
+
It quickly becomes laborious
+
Start writing in a .R file and use RStudio’s short cut keys for the
+Run command to push the current line, selected lines or modified lines
+to the interactive R console.
+
This is a great way to start; all your code is saved for later
+
You will be able to run the file you create from within RStudio or
+using R’s source() function.
+
+
+
+
+
+
Tip: Running segments of your code
+
+
+
RStudio offers you great flexibility in running code from within the
+editor window. There are buttons, menu choices, and keyboard shortcuts.
+To run the current line, you can
+
click on the Run button above the editor panel, or
+
select “Run Lines” from the “Code” menu, or
+
hit Ctrl+Return in Windows or Linux or
+⌘+Return on OS X. (This shortcut can also be seen
+by hovering the mouse over the button). To run a block of code, select
+it and then Run. If you have modified a line of code within
+a block of code you have just run, there is no need to reselect the
+section and Run, you can use the next button along,
+Re-run the previous region. This will run the previous code
+block including the modifications you have made.
+
+
+
+
Introduction to R
+
+
Much of your time in R will be spent in the R interactive console.
+This is where you will run all of your code, and can be a useful
+environment to try out ideas before adding them to an R script file.
+This console in RStudio is the same as the one you would get if you
+typed in R in your command-line environment.
+
The first thing you will see in the R interactive session is a bunch
+of information, followed by a “>” and a blinking cursor. In many ways
+this is similar to the shell environment you learned about during the
+shell lessons: it operates on the same idea of a “Read, evaluate, print
+loop”: you type in commands, R tries to execute them, and then returns a
+result.
+
Using R as a calculator
+
+
The simplest thing you could do with R is to do arithmetic:
+
+
R
+
+
+1+100
+
+
+
OUTPUT
+
+
[1] 101
+
+
And R will print out the answer, with a preceding “[1]”. [1] is the
+index of the first element of the line being printed in the console. For
+more information on indexing vectors, see Episode
+6: Subsetting Data.
+
If you type in an incomplete command, R will wait for you to complete
+it. If you are familiar with Unix Shell’s bash, you may recognize
+this
+behavior from bash.
+
+
R
+
+
>1+
+
+
+
OUTPUT
+
+
+
+
+
Any time you hit return and the R session shows a “+” instead of a
+“>”, it means it’s waiting for you to complete the command. If you
+want to cancel a command you can hit Esc and RStudio will
+give you back the “>” prompt.
+
+
+
+
+
+
Tip: Canceling commands
+
+
+
If you’re using R from the command line instead of from within
+RStudio, you need to use Ctrl+C instead of
+Esc to cancel the command. This applies to Mac users as
+well!
+
Canceling a command isn’t only useful for killing incomplete
+commands: you can also use it to tell R to stop running code (for
+example if it’s taking much longer than you expect), or to get rid of
+the code you’re currently writing.
+
+
+
+
When using R as a calculator, the order of operations is the same as
+you would have learned back in school.
+
From highest to lowest precedence:
+
Parentheses: (, )
+
+
Exponents: ^ or **
+
+
Multiply: *
+
+
Divide: /
+
+
Add: +
+
+
Subtract: -
+
+
+
R
+
+
+3+5*2
+
+
+
OUTPUT
+
+
[1] 13
+
+
Use parentheses to group operations in order to force the order of
+evaluation if it differs from the default, or to make clear what you
+intend.
+
+
R
+
+
+(3+5)*2
+
+
+
OUTPUT
+
+
[1] 16
+
+
This can get unwieldy when not needed, but clarifies your intentions.
+Remember that others may later read your code.
+
+
R
+
+
+(3+(5*(2^2)))# hard to read
+3+5*2^2# clear, if you remember the rules
+3+5*(2^2)# if you forget some rules, this might help
+
+
The text after each line of code is called a “comment”. Anything that
+follows after the hash (or octothorpe) symbol # is ignored
+by R when it executes code.
+
Really small or large numbers get a scientific notation:
+
+
R
+
+
+2/10000
+
+
+
OUTPUT
+
+
[1] 2e-04
+
+
Which is shorthand for “multiplied by 10^XX”. So
+2e-4 is shorthand for 2 * 10^(-4).
+
You can write numbers in scientific notation too:
+
+
R
+
+
+5e3# Note the lack of minus here
+
+
+
OUTPUT
+
+
[1] 5000
+
+
Mathematical functions
+
+
R has many built in mathematical functions. To call a function, we
+can type its name, followed by open and closing parentheses. Functions
+take arguments as inputs, anything we type inside the parentheses of a
+function is considered an argument. Depending on the function, the
+number of arguments can vary from none to multiple. For example:
+
+
R
+
+
+getwd()#returns an absolute filepath
+
+
doesn’t require an argument, whereas for the next set of mathematical
+functions we will need to supply the function a value in order to
+compute the result.
+
+
R
+
+
+sin(1)# trigonometry functions
+
+
+
OUTPUT
+
+
[1] 0.841471
+
+
+
R
+
+
+log(1)# natural logarithm
+
+
+
OUTPUT
+
+
[1] 0
+
+
+
R
+
+
+log10(10)# base-10 logarithm
+
+
+
OUTPUT
+
+
[1] 1
+
+
+
R
+
+
+exp(0.5)# e^(1/2)
+
+
+
OUTPUT
+
+
[1] 1.648721
+
+
Don’t worry about trying to remember every function in R. You can
+look them up on Google, or if you can remember the start of the
+function’s name, use the tab completion in RStudio.
+
This is one advantage that RStudio has over R on its own, it has
+auto-completion abilities that allow you to more easily look up
+functions, their arguments, and the values that they take.
+
Typing a ? before the name of a command will open the
+help page for that command. When using RStudio, this will open the
+‘Help’ pane; if using R in the terminal, the help page will open in your
+browser. The help page will include a detailed description of the
+command and how it works. Scrolling to the bottom of the help page will
+usually show a collection of code examples which illustrate command
+usage. We’ll go through an example later.
+
Comparing things
+
+
We can also do comparisons in R:
+
+
R
+
+
+1==1# equality (note two equals signs, read as "is equal to")
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1!=2# inequality (read as "is not equal to")
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1<2# less than
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1<=1# less than or equal to
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1>0# greater than
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1>=-9# greater than or equal to
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
+
+
+
+
Tip: Comparing Numbers
+
+
+
A word of warning about comparing numbers: you should never use
+== to compare two numbers unless they are integers (a data
+type which can specifically represent only whole numbers).
+
Computers may only represent decimal numbers with a certain degree of
+precision, so two numbers which look the same when printed out by R, may
+actually have different underlying representations and therefore be
+different by a small margin of error (called Machine numeric
+tolerance).
We can store values in variables using the assignment operator
+<-, like this:
+
+
R
+
+
+x<-1/40
+
+
Notice that assignment does not print a value. Instead, we stored it
+for later in something called a variable.
+x now contains the value
+0.025:
+
+
R
+
+
+x
+
+
+
OUTPUT
+
+
[1] 0.025
+
+
More precisely, the stored value is a decimal approximation
+of this fraction called a floating point
+number.
+
Look for the Environment tab in the top right panel of
+RStudio, and you will see that x and its value have
+appeared. Our variable x can be used in place of a number
+in any calculation that expects a number:
+
+
R
+
+
+log(x)
+
+
+
OUTPUT
+
+
[1] -3.688879
+
+
Notice also that variables can be reassigned:
+
+
R
+
+
+x<-100
+
+
x used to contain the value 0.025 and now it has the
+value 100.
+
Assignment values can contain the variable being assigned to:
+
+
R
+
+
+x<-x+1#notice how RStudio updates its description of x on the top right tab
+y<-x*2
+
+
The right hand side of the assignment can be any valid R expression.
+The right hand side is fully evaluated before the assignment
+occurs.
+
Variable names can contain letters, numbers, underscores and periods
+but no spaces. They must start with a letter or a period followed by a
+letter (they cannot start with a number nor an underscore). Variables
+beginning with a period are hidden variables. Different people use
+different conventions for long variable names, these include
+
periods.between.words
+
underscores_between_words
+
camelCaseToSeparateWords
+
What you use is up to you, but be consistent.
+
It is also possible to use the = operator for
+assignment:
+
+
R
+
+
+x=1/40
+
+
But this is much less common among R users. The most important thing
+is to be consistent with the operator you use. There
+are occasionally places where it is less confusing to use
+<- than =, and it is the most common symbol
+used in the community. So the recommendation is to use
+<-.
+
+
+
+
+
+
Challenge 1
+
+
+
Which of the following are valid R variable names?
The following will not be able to be used to create a variable
+
+
R
+
+
_age
+min-length
+2widths
+
+
+
+
+
+
Vectorization
+
+
One final thing to be aware of is that R is vectorized,
+meaning that variables and functions can have vectors as values. In
+contrast to physics and mathematics, a vector in R describes a set of
+values in a certain order of the same data type. For example
+
+
R
+
+
+1:5
+
+
+
OUTPUT
+
+
[1] 1 2 3 4 5
+
+
+
R
+
+
+2^(1:5)
+
+
+
OUTPUT
+
+
[1] 2 4 8 16 32
+
+
+
R
+
+
+x<-1:5
+2^x
+
+
+
OUTPUT
+
+
[1] 2 4 8 16 32
+
+
This is incredibly powerful; we will discuss this further in an
+upcoming lesson.
+
Managing your environment
+
+
There are a few useful commands you can use to interact with the R
+session.
+
ls will list all of the variables and functions stored
+in the global environment (your working R session):
+
+
R
+
+
+ls()
+
+
+
OUTPUT
+
+
[1] "x" "y"
+
+
+
+
+
+
+
Tip: hidden objects
+
+
+
Like in the shell, ls will hide any variables or
+functions starting with a “.” by default. To list all objects, type
+ls(all.names=TRUE) instead
+
+
+
+
Note here that we didn’t give any arguments to ls, but
+we still needed to give the parentheses to tell R to call the
+function.
+
If we type ls by itself, R prints a bunch of code
+instead of a listing of objects.
+
+
R
+
+
+ls
+
+
+
OUTPUT
+
+
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
+ pattern, sorted = TRUE)
+{
+ if (!missing(name)) {
+ pos <- tryCatch(name, error = function(e) e)
+ if (inherits(pos, "error")) {
+ name <- substitute(name)
+ if (!is.character(name))
+ name <- deparse(name)
+ warning(gettextf("%s converted to character string",
+ sQuote(name)), domain = NA)
+ pos <- name
+ }
+ }
+ all.names <- .Internal(ls(envir, all.names, sorted))
+ if (!missing(pattern)) {
+ if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
+ ll != length(grep("]", pattern, fixed = TRUE))) {
+ if (pattern == "[") {
+ pattern <- "\\["
+ warning("replaced regular expression pattern '[' by '\\\\['")
+ }
+ else if (length(grep("[^\\\\]\\[<-", pattern))) {
+ pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+ warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+ }
+ }
+ grep(pattern, all.names, value = TRUE)
+ }
+ else all.names
+}
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>
+
+
What’s going on here?
+
Like everything in R, ls is the name of an object, and
+entering the name of an object by itself prints the contents of the
+object. The object x that we created earlier contains 1, 2,
+3, 4, 5:
+
+
R
+
+
+x
+
+
+
OUTPUT
+
+
[1] 1 2 3 4 5
+
+
The object ls contains the R code that makes the
+ls function work! We’ll talk more about how functions work
+and start writing our own later.
+
You can use rm to delete objects you no longer need:
+
+
R
+
+
+rm(x)
+
+
If you have lots of things in your environment and want to delete all
+of them, you can pass the results of ls to the
+rm function:
+
+
R
+
+
+rm(list =ls())
+
+
In this case we’ve combined the two. Like the order of operations,
+anything inside the innermost parentheses is evaluated first, and so
+on.
+
In this case we’ve specified that the results of ls
+should be used for the list argument in rm.
+When assigning values to arguments by name, you must use the
+= operator!!
+
If instead we use <-, there will be unintended side
+effects, or you may get an error message:
+
+
R
+
+
+rm(list<-ls())
+
+
+
ERROR
+
+
Error in rm(list <- ls()): ... must contain names or character strings
+
+
+
+
+
+
+
Tip: Warnings vs. Errors
+
+
+
Pay attention when R does something unexpected! Errors, like above,
+are thrown when R cannot proceed with a calculation. Warnings on the
+other hand usually mean that the function has run, but it probably
+hasn’t worked as expected.
+
In both cases, the message that R prints out usually give you clues
+how to fix a problem.
+
+
+
+
R Packages
+
+
It is possible to add functions to R by writing a package, or by
+obtaining a package written by someone else. As of this writing, there
+are over 10,000 packages available on CRAN (the comprehensive R archive
+network). R and RStudio have functionality for managing packages:
+
You can see what packages are installed by typing
+installed.packages()
+
+
You can install packages by typing
+install.packages("packagename"), where
+packagename is the package name, in quotes.
+
You can update installed packages by typing
+update.packages()
+
+
You can remove a package with
+remove.packages("packagename")
+
+
You can make a package available for use with
+library(packagename)
+
+
Packages can also be viewed, loaded, and detached in the Packages tab
+of the lower right panel in RStudio. Clicking on this tab will display
+all of the installed packages with a checkbox next to them. If the box
+next to a package name is checked, the package is loaded and if it is
+empty, the package is not loaded. Click an empty box to load that
+package and click a checked box to detach that package.
+
Packages can be installed and updated from the Package tab with the
+Install and Update buttons at the top of the tab.
+
+
+
+
+
+
Challenge 2
+
+
+
What will be the value of each variable after each statement in the
+following program?
The scientific process is naturally incremental, and many projects
+start life as random notes, some code, then a manuscript, and eventually
+everything is a bit mixed together.
+
+
+Managing your projects in a reproducible fashion doesn’t just make your
+science reproducible, it makes your life easier.
+
Most people tend to organize their projects like this:
+
There are many reasons why we should ALWAYS avoid this:
+
It is really hard to tell which version of your data is the original
+and which is the modified;
+
It gets really messy because it mixes files with various extensions
+together;
+
It probably takes you a lot of time to actually find things, and
+relate the correct figures to the exact code that has been used to
+generate it;
+
A good project layout will ultimately make your life easier:
+
It will help ensure the integrity of your data;
+
It makes it simpler to share your code with someone else (a
+lab-mate, collaborator, or supervisor);
+
It allows you to easily upload your code with your manuscript
+submission;
+
It makes it easier to pick the project back up after a break.
+
A possible solution
+
+
Fortunately, there are tools and packages which can help you manage
+your work effectively.
+
One of the most powerful and useful aspects of RStudio is its project
+management functionality. We’ll be using this today to create a
+self-contained, reproducible project.
+
+
+
+
+
+
Challenge 1: Creating a self-contained
+project
+
+
+
We’re going to create a new project in RStudio:
+
Click the “File” menu button, then “New Project”.
+
Click “New Directory”.
+
Click “New Project”.
+
Type in the name of the directory to store your project,
+e.g. “my_project”.
+
If available, select the checkbox for “Create a git
+repository.”
+
Click the “Create Project” button.
+
+
+
+
The simplest way to open an RStudio project once it has been created
+is to click through your file system to get to the directory where it
+was saved and double click on the .Rproj file. This will
+open RStudio and start your R session in the same directory as the
+.Rproj file. All your data, plots and scripts will now be
+relative to the project directory. RStudio projects have the added
+benefit of allowing you to open multiple projects at the same time each
+open to its own project directory. This allows you to keep multiple
+projects open without them interfering with each other.
+
+
+
+
+
+
Challenge 2: Opening an RStudio project
+through the file system
+
+
+
Exit RStudio.
+
Navigate to the directory where you created a project in Challenge
+1.
+
Double click on the .Rproj file in that directory.
+
+
+
+
Best practices for project organization
+
+
Although there is no “best” way to lay out a project, there are some
+general principles to adhere to that will make project management
+easier:
+
+
Treat data as read only
+
This is probably the most important goal of setting up a project.
+Data is typically time consuming and/or expensive to collect. Working
+with them interactively (e.g., in Excel) where they can be modified
+means you are never sure of where the data came from, or how it has been
+modified since collection. It is therefore a good idea to treat your
+data as “read-only”.
+
+
+
Data Cleaning
+
In many cases your data will be “dirty”: it will need significant
+preprocessing to get into a format R (or any other programming language)
+will find useful. This task is sometimes called “data munging”. Storing
+these scripts in a separate folder, and creating a second “read-only”
+data folder to hold the “cleaned” data sets can prevent confusion
+between the two sets.
+
+
+
Treat generated output as disposable
+
Anything generated by your scripts should be treated as disposable:
+it should all be able to be regenerated from your scripts.
+
There are lots of different ways to manage this output. Having an
+output folder with different sub-directories for each separate analysis
+makes it easier later. Since many analyses are exploratory and don’t end
+up being used in the final project, and some of the analyses get shared
+between projects.
+
+
+
+
+
+
Tip: Good Enough Practices for Scientific
+Computing
+
Put each project in its own directory, which is named after the
+project.
+
Put text documents associated with the project in the
+doc directory.
+
Put raw data and metadata in the data directory, and
+files generated during cleanup and analysis in a results
+directory.
+
Put source for the project’s scripts and programs in the
+src directory, and programs brought in from elsewhere or
+compiled locally in the bin directory.
+
Name all files to reflect their content or function.
+
+
+
+
+
+
Separate function definition and application
+
One of the more effective ways to work with R is to start by writing
+the code you want to run directly in a .R script, and then running the
+selected lines (either using the keyboard shortcuts in RStudio or
+clicking the “Run” button) in the interactive R console.
+
When your project is in its early stages, the initial .R script file
+usually contains many lines of directly executed code. As it matures,
+reusable chunks get pulled into their own functions. It’s a good idea to
+separate these functions into two separate folders; one to store useful
+functions that you’ll reuse across analyses and projects, and one to
+store the analysis scripts.
+
+
+
Save the data in the data directory
+
Now we have a good directory structure we will now place/save the
+data file in the data/ directory.
Download the file (right mouse click on the link above -> “Save
+link as” / “Save file as”, or click on the link and after the page
+loads, press Ctrl+S or choose File -> “Save
+page as”)
+
Make sure it’s saved under the name
+gapminder_data.csv
+
+
Save the file in the data/ folder within your
+project.
+
We will load and inspect these data later.
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
It is useful to get some general idea about the dataset, directly
+from the command line, before loading it into R. Understanding the
+dataset better will come in handy when making decisions on how to load
+it in R. Use the command-line shell to answer the following
+questions:
+
What is the size of the file?
+
How many rows of data does it contain?
+
What kinds of values are stored in this file?
+
+
+
+
+
+
+
+
+
By running these commands in the shell:
+
+
SH
+
+
ls-lh data/gapminder_data.csv
+
+
+
OUTPUT
+
+
-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv
The Terminal tab in the console pane provides a convenient place
+directly within RStudio to interact directly with the command line.
+
+
+
+
+
+
Working directory
+
Knowing R’s current working directory is important because when you
+need to access other files (for example, to import a data file), R will
+look for them relative to the current working directory.
+
Each time you create a new RStudio Project, it will create a new
+directory for that project. When you open an existing
+.Rproj file, it will open that project and set R’s working
+directory to the folder that file is in.
+
+
+
+
+
+
Challenge 5
+
+
+
You can check the current working directory with the
+getwd() command, or by using the menus in RStudio.
+
In the console, type getwd() (“wd” is short for
+“working directory”) and hit Enter.
+
In the Files pane, double click on the data folder to
+open it (or navigate to any other folder you wish). To get the Files
+pane back to the current working directory, click “More” and then select
+“Go To Working Directory”.
+
You can change the working directory with setwd(), or by
+using RStudio menus.
+
In the console, type setwd("data") and hit Enter. Type
+getwd() and hit Enter to see the new working
+directory.
+
In the menus at the top of the RStudio window, click the “Session”
+menu button, and then select “Set Working Directory” and then “Choose
+Directory”. Next, in the windows navigator that opens, navigate back to
+the project directory, and click “Open”. Note that a setwd
+command will automatically appear in the console.
+
+
+
+
+
+
+
+
+
Tip: File does not exist errors
+
+
+
When you’re attempting to reference a file in your R code and you’re
+getting errors saying the file doesn’t exist, it’s a good idea to check
+your working directory. You need to either provide an absolute path to
+the file, or you need to make sure the file is saved in the working
+directory (or a subfolder of the working directory) and provide a
+relative path.
To be able to read R help files for functions and special
+operators.
+
To be able to use CRAN task views to identify packages to solve a
+problem.
+
To be able to seek help from your peers.
+
+
+
+
+
+
Reading Help Files
+
+
R, and every package, provide help files for functions. The general
+syntax to search for help on any function, “function_name”, from a
+specific function that is in a package loaded into your namespace (your
+interactive R session) is:
+
+
R
+
+
+?function_name
+help(function_name)
+
+
For example take a look at the help file for
+write.table(), we will be using a similar function in an
+upcoming episode.
+
+
R
+
+
+?write.table()
+
+
This will load up a help page in RStudio (or as plain text in R
+itself).
+
Each help page is broken down into sections:
+
Description: An extended description of what the function does.
+
Usage: The arguments of the function and their default values (which
+can be changed).
+
Arguments: An explanation of the data each argument is
+expecting.
+
Details: Any important details to be aware of.
+
Value: The data the function returns.
+
See Also: Any related functions you might find useful.
+
Examples: Some examples for how to use the function.
+
Different functions might have different sections, but these are the
+main ones you should be aware of.
+
Notice how related functions might call for the same help file:
+
+
R
+
+
+?write.table()
+?write.csv()
+
+
This is because these functions have very similar applicability and
+often share the same arguments as inputs to the function, so package
+authors often choose to document them together in a single help
+file.
+
+
+
+
+
+
Tip: Running Examples
+
+
+
From within the function help page, you can highlight code in the
+Examples and hit Ctrl+Return to run it in RStudio
+console. This gives you a quick way to get a feel for how a function
+works.
+
+
+
+
+
+
+
+
+
Tip: Reading Help Files
+
+
+
One of the most daunting aspects of R is the large number of
+functions available. It would be prohibitive, if not impossible to
+remember the correct usage for every function you use. Luckily, using
+the help files means you don’t have to remember that!
+
+
+
+
Special Operators
+
+
To seek help on special operators, use quotes or backticks:
+
+
R
+
+
+?"<-"
+?`<-`
+
+
Getting Help with Packages
+
+
Many packages come with “vignettes”: tutorials and extended example
+documentation. Without any arguments, vignette() will list
+all vignettes for all installed packages;
+vignette(package="package-name") will list all available
+vignettes for package-name, and
+vignette("vignette-name") will open the specified
+vignette.
+
If a package doesn’t have any vignettes, you can usually find help by
+typing help("package-name").
+
RStudio also has a set of excellent cheatsheets for
+many packages.
+
When You Remember Part of the Function Name
+
+
If you’re not sure what package a function is in or how it’s
+specifically spelled, you can do a fuzzy search:
+
+
R
+
+
+??function_name
+
+
A fuzzy search is when you search for an approximate string match.
+For example, you may remember that the function to set your working
+directory includes “set” in its name. You can do a fuzzy search to help
+you identify the function:
+
+
R
+
+
+??set
+
+
When You Have No Idea Where to Begin
+
+
If you don’t know what function or package you need to use CRAN Task Views is a
+specially maintained list of packages grouped into fields. This can be a
+good starting point.
+
When Your Code Doesn’t Work: Seeking Help from Your Peers
+
+
If you’re having trouble using a function, 9 times out of 10, the
+answers you seek have already been answered on Stack Overflow. You can search
+using the [r] tag. Please make sure to see their page on how to ask a good
+question.
+
If you can’t find the answer, there are a few useful functions to
+help you ask your peers:
+
+
R
+
+
+?dput
+
+
Will dump the data you’re working with into a format that can be
+copied and pasted by others into their own R session.
+
+
R
+
+
+sessionInfo()
+
+
+
OUTPUT
+
+
R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
+ [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
+ [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
+[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
+
+time zone: UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1 tools_4.3.1 rstudioapi_0.15.0 yaml_2.3.7
+[5] knitr_1.43 xfun_0.40 renv_1.0.3 evaluate_0.21
+
+
Will print out your current version of R, as well as any packages you
+have loaded. This can be useful for others to help reproduce and debug
+your issue.
+
+
+
+
+
+
Challenge 1
+
+
+
Look at the help page for the c function. What kind of
+vector do you expect will be created if you evaluate the following:
+
+
R
+
+
+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+
+
+
+
+
+
+
+
+
+
The c() function creates a vector, in which all elements
+are of the same type. In the first case, the elements are numeric, in
+the second, they are characters, and in the third they are also
+characters: the numeric values are “coerced” to be characters.
+
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Look at the help for the paste function. You will need
+to use it later. What’s the difference between the sep and
+collapse arguments?
+
+
+
+
+
+
+
+
+
To look at the help for the paste() function, use:
+
+
R
+
+
+help("paste")
+?paste
+
+
The difference between sep and collapse is
+a little tricky. The paste function accepts any number of
+arguments, each of which can be a vector of any length. The
+sep argument specifies the string used between concatenated
+terms — by default, a space. The result is a vector as long as the
+longest argument supplied to paste. In contrast,
+collapse specifies that after concatenation the elements
+are collapsed together using the given separator, the result
+being a single string.
+
It is important to call the arguments explicitly by typing out the
+argument name e.g sep = "," so the function understands to
+use the “,” as a separator and not a term to concatenate. e.g.
+
+
R
+
+
+paste(c("a","b"), "c")
+
+
+
OUTPUT
+
+
[1] "a c" "b c"
+
+
+
R
+
+
+paste(c("a","b"), "c", ",")
+
+
+
OUTPUT
+
+
[1] "a c ," "b c ,"
+
+
+
R
+
+
+paste(c("a","b"), "c", sep =",")
+
+
+
OUTPUT
+
+
[1] "a,c" "b,c"
+
+
+
R
+
+
+paste(c("a","b"), "c", collapse ="|")
+
+
+
OUTPUT
+
+
[1] "a c|b c"
+
+
+
R
+
+
+paste(c("a","b"), "c", sep =",", collapse ="|")
+
+
+
OUTPUT
+
+
[1] "a,c|b,c"
+
+
(For more information, scroll to the bottom of the
+?paste help page and look at the examples, or try
+example('paste').)
+
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Use help to find a function (and its associated parameters) that you
+could use to load data from a tabular file in which columns are
+delimited with “\t” (tab) and the decimal point is a “.” (period). This
+check for decimal separator is important, especially if you are working
+with international colleagues, because different countries have
+different conventions for the decimal point (i.e. comma vs period).
+Hint: use ??"read table" to look up functions related to
+reading in tabular data.
+
+
+
+
+
+
+
+
+
The standard R function for reading tab-delimited files with a period
+decimal separator is read.delim(). You can also do this with
+read.table(file, sep="\t") (the period is the
+default decimal separator for read.table()),
+although you may have to change the comment.char argument
+as well if your data file contains hash (#) characters.
To begin exploring data frames, and understand how they are related
+to vectors and lists.
+
To be able to ask questions from R about the type, class, and
+structure of an object.
+
To understand the information of the attributes “names”, “class”,
+and “dim”.
+
+
+
+
+
+
One of R’s most powerful features is its ability to deal with tabular
+data - such as you may already have in a spreadsheet or a CSV file.
+Let’s start by making a toy dataset in your data/
+directory, called feline-data.csv:
We can now save cats as a CSV file. It is good practice
+to call the argument names explicitly so the function knows what default
+values you are changing. Here we are setting
+row.names = FALSE. Recall you can use
+?write.csv to pull up the help file to check out the
+argument names and their default values.
The read.table function is used for reading in tabular
+data stored in a text file where the columns of data are separated by
+punctuation characters such as CSV files (csv = comma-separated values).
+Tabs and commas are the most common punctuation characters used to
+separate or delimit data points in csv files. For convenience R provides
+2 other versions of read.table. These are:
+read.csv for files where the data are separated with commas
+and read.delim for files where the data are separated with
+tabs. Of these three functions read.csv is the most
+commonly used. If needed it is possible to override the default
+delimiting punctuation marks for both read.csv and
+read.delim.
+
+
+
+
+
+
Check your data for factors
+
+
+
In recent times, the default way how R handles textual data has
+changed. Text data was interpreted by R automatically into a format
+called “factors”. But there is an easier format that is called
+“character”. We will hear about factors later, and what to use them for.
+For now, remember that in most cases, they are not needed and only
+complicate your life, which is why newer R versions read in text as
+“character”. Check now if your version of R has automatically created
+factors and convert them to “character” format:
+
Check the data types of your input by typing
+str(cats)
+
+
In the output, look at the three-letter codes after the colons: If
+you see only “num” and “chr”, you can continue with the lesson and skip
+this box. If you find “fct”, continue to step 3.
+
Prevent R from automatically creating “factor” data. That can be
+done by the following code:
+options(stringsAsFactors = FALSE). Then, re-read the cats
+table for the change to take effect.
+
You must set this option every time you restart R. To not forget
+this, include it in your analysis script before you read in any data,
+for example in one of the first lines.
+
For R versions greater than 4.0.0, text data is no longer converted
+to factors anymore. So you can install this or a newer version to avoid
+this problem. If you are working on an institute or company computer,
+ask your administrator to do it.
+
+
+
+
We can begin exploring our dataset right away, pulling out columns by
+specifying them using the $ operator:
+
+
R
+
+
+cats$weight
+
+
+
OUTPUT
+
+
[1] 2.1 5.0 3.2
+
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
We can do other operations on the columns:
+
+
R
+
+
+## Say we discovered that the scale weighs two Kg light:
+cats$weight+2
+
+
+
OUTPUT
+
+
[1] 4.1 7.0 5.2
+
+
+
R
+
+
+paste("My cat is", cats$coat)
+
+
+
OUTPUT
+
+
[1] "My cat is calico" "My cat is black" "My cat is tabby"
+
+
But what about
+
+
R
+
+
+cats$weight+cats$coat
+
+
+
ERROR
+
+
Error in cats$weight + cats$coat: non-numeric argument to binary operator
+
+
Understanding what happened here is key to successfully analyzing
+data in R.
+
+
Data Types
+
If you guessed that the last command will return an error because
+2.1 plus "black" is nonsense, you’re right -
+and you already have some intuition for an important concept in
+programming called data types. We can ask what type of data
+something is:
+
+
R
+
+
+typeof(cats$weight)
+
+
+
OUTPUT
+
+
[1] "double"
+
+
There are 5 main types: double, integer,
+complex, logical and character.
+For historic reasons, double is also called
+numeric.
+
+
R
+
+
+typeof(3.14)
+
+
+
OUTPUT
+
+
[1] "double"
+
+
+
R
+
+
+typeof(1L)# The L suffix forces the number to be an integer, since by default R uses float numbers
+
+
+
OUTPUT
+
+
[1] "integer"
+
+
+
R
+
+
+typeof(1+1i)
+
+
+
OUTPUT
+
+
[1] "complex"
+
+
+
R
+
+
+typeof(TRUE)
+
+
+
OUTPUT
+
+
[1] "logical"
+
+
+
R
+
+
+typeof('banana')
+
+
+
OUTPUT
+
+
[1] "character"
+
+
No matter how complicated our analyses become, all data in R is
+interpreted as one of these basic data types. This strictness has some
+really important consequences.
+
A user has added details of another cat. This information is in the
+file data/feline-data_v2.csv.
+
+
R
+
+
+file.show("data/feline-data_v2.csv")
+
+
+
R
+
+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+tabby,2.3 or 2.4,1
+
+
Load the new cats data like before, and check what type of data we
+find in the weight column:
Oh no, our weights aren’t the double type anymore! If we try to do
+the same math we did on them before, we run into trouble:
+
+
R
+
+
+cats$weight+2
+
+
+
ERROR
+
+
Error in cats$weight + 2: non-numeric argument to binary operator
+
+
What happened? The cats data we are working with is
+something called a data frame. Data frames are one of the most
+common and versatile types of data structures we will work with
+in R. A given column in a data frame cannot be composed of different
+data types. In this case, R does not read everything in the data frame
+column weight as a double, therefore the entire
+column data type changes to something that is suitable for everything in
+the column.
+
When R reads a csv file, it reads it in as a data frame.
+Thus, when we loaded the cats csv file, it is stored as a
+data frame. We can recognize data frames by the first row that is
+written by the str() function:
Data frames are composed of rows and columns, where each
+column has the same number of rows. Different columns in a data frame
+can be made up of different data types (this is what makes them so
+versatile), but everything in a given column needs to be the same type
+(e.g., vector, factor, or list).
+
Let’s explore more about different data structures and how they
+behave. For now, let’s remove that extra line from our cats data and
+reload it, while we investigate this behavior further:
To better understand this behavior, let’s meet another of the data
+structures: the vector.
+
+
R
+
+
+my_vector<-vector(length =3)
+my_vector
+
+
+
OUTPUT
+
+
[1] FALSE FALSE FALSE
+
+
A vector in R is essentially an ordered list of things, with the
+special condition that everything in the vector must be the same
+basic data type. If you don’t choose the datatype, it’ll default to
+logical; or, you can declare an empty vector of whatever
+type you like.
The somewhat cryptic output from this command indicates the basic
+data type found in this vector - in this case chr,
+character; an indication of the number of things in the vector -
+actually, the indexes of the vector, in this case [1:3];
+and a few examples of what’s actually in the vector - in this case empty
+character strings. If we similarly do
+
+
R
+
+
+str(cats$weight)
+
+
+
OUTPUT
+
+
num [1:3] 2.1 5 3.2
+
+
we see that cats$weight is a vector, too - the
+columns of data we load into R data.frames are all vectors, and
+that’s the root of why R forces everything in a column to be the same
+basic data type.
+
+
+
+
+
+
Discussion 1
+
+
+
Why is R so opinionated about what we put in our columns of data? How
+does this help us?
+
+
+
+
+
+
By keeping everything in a column the same, we allow ourselves to
+make simple assumptions about our data; if you can interpret one entry
+in the column as a number, then you can interpret all of them
+as numbers, so we don’t have to check every time. This consistency is
+what people mean when they talk about clean data; in the long
+run, strict consistency goes a long way to making our lives easier in
+R.
+
+
+
+
+
+
+
+
+
Coercion by combining vectors
+
You can also make vectors with explicit contents with the combine
+function:
+
+
R
+
+
+combine_vector<-c(2,6,3)
+combine_vector
+
+
+
OUTPUT
+
+
[1] 2 6 3
+
+
Given what we’ve learned so far, what do you think the following will
+produce?
+
+
R
+
+
+quiz_vector<-c(2,6,'3')
+
+
This is something called type coercion, and it is the source
+of many surprises and the reason why we need to be aware of the basic
+data types and how R will interpret them. When R encounters a mix of
+types (here double and character) to be combined into a single vector,
+it will force them all to be the same type. Consider:
The coercion rules go: logical ->
+integer -> double (“numeric”)
+-> complex -> character, where -> can
+be read as are transformed into. For example, combining
+logical and character transforms the result to
+character:
+
+
R
+
+
+c('a', TRUE)
+
+
+
OUTPUT
+
+
[1] "a" "TRUE"
+
+
A quick way to recognize character vectors is by the
+quotes that enclose them when they are printed.
+
You can try to force coercion against this flow using the
+as. functions:
As you can see, some surprising things can happen when R forces one
+basic data type into another! Nitty-gritty of type coercion aside, the
+point is: if your data doesn’t look like what you thought it was going
+to look like, type coercion may well be to blame; make sure everything
+is the same type in your vectors and your columns of data.frames, or you
+will get nasty surprises!
+
But coercion can also be very useful! For example, in our
+cats data likes_string is numeric, but we know
+that the 1s and 0s actually represent TRUE and
+FALSE (a common way of representing them). We should use
+the logical datatype here, which has two states:
+TRUE or FALSE, which is exactly what our data
+represents. We can ‘coerce’ this column to be logical by
+using the as.logical function:
An important part of every data analysis is cleaning the input data.
+If you know that the input data is all of the same format,
+(e.g. numbers), your analysis is much easier! Clean the cat data set
+from the chapter about type coercion.
+
+
Copy the code template
+
Create a new script in RStudio and copy and paste the following code.
+Then move on to the tasks below, which help you to fill in the gaps
+(______).
+
# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+
+# 1. Print the data
+_____
+
+# 2. Show an overview of the table with all data types
+_____(cats)
+
+# 3. The "weight" column has the incorrect data type __________.
+# The correct data type is: ____________.
+
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+# print the data again to see the effect
+cats
+
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+
+# Calculate the mean to test yourself
+mean(cats$weight)
+
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!
+
+
+
Instructions for the tasks
+
+
1. Print the data
+
Execute the first statement (read.csv(...)). Then print
+the data to the console
+
+
+
+
+
+
+
+
+
+
+
Show the content of any variable by typing its name.
+
+
Solution to Challenge 1.1
+
Two correct solutions:
+
cats
+print(cats)
+
+
+
+
+
+
+
+
+
+
+
2. Overview of the data types
+
+
+
The data type of your data is as important as the data itself. Use a
+function we saw earlier to print out the data types of all columns of
+the cats table.
+
+
+
+
+
+
+
+
+
In the chapter “Data types” we saw two functions that can show data
+types. One printed just a single word, the data type name. The other
+printed a short form of the data type, and the first few values. We need
+the second here.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to Challenge 1.2
+
str(cats)
+
+
+
3. Which data type do we need?
+
The shown data type is not the right one for this data (weight of a
+cat). Which data type do we need?
+
Why did the read.csv() function not choose the correct
+data type?
+
Fill in the gap in the comment with the correct data type for cat
+weight!
+
+
+
+
+
+
+
+
+
+
Scroll up to the section about the type
+hierarchy to review the available data types
+
+
+
+
+
+
+
+
+
+
Weight is expressed on a continuous scale (real numbers). The R data
+type for this is “double” (also known as “numeric”).
+
The fourth row has the value “2.3 or 2.4”. That is not a number but
+two, and an english word. Therefore, the “character” data type is
+chosen. The whole column is now text, because all values in the same
+columns have to be the same data type.
+
+
+
+
+
+
+
+
+
+
4. Correct the problematic value
+
+
+
The code to assign a new weight value to the problematic fourth row
+is given. Think first and then execute it: What will be the data type
+after assigning a number like in this example? You can check the data
+type after executing to see if you were right.
+
+
+
+
+
+
+
+
+
Revisit the hierarchy of data types when two different data types are
+combined.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to challenge 1.4
+
The data type of the column “weight” is “character”. The assigned
+data type is “double”. Combining two data types yields the data type
+that is higher in the following hierarchy:
+
logical < integer < double < complex < character
+
Therefore, the column is still of type character! We need to manually
+convert it to “double”. {: .solution}
+
+
+
5. Convert the column “weight” to the correct data type
+
Cat weight are numbers. But the column does not have this data type
+yet. Coerce the column to floating point numbers.
+
+
+
+
+
+
+
+
+
+
The functions to convert data types start with as.. You
+can look for the function further up in the manuscript or use the
+RStudio auto-complete function: Type “as.” and then press
+the TAB key.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to Challenge 1.5
+
There are two functions that are synonymous for historic reasons:
To change a single element, use the bracket on the other side of the
+arrow:
+
+
R
+
+
+sequence_example[1]<-30
+sequence_example
+
+
+
OUTPUT
+
+
[1] 30 21 22 23 24 25
+
+
+
+
+
+
+
Challenge 2
+
+
+
Start by making a vector with the numbers 1 through 26. Then,
+multiply the vector by 2.
+
+
+
+
+
+
+
+
+
+
R
+
+
+x<-1:26
+x<-x*2
+
+
+
+
+
+
+
+
Lists
+
Another data structure you’ll want in your bag of tricks is the
+list. A list is simpler in some ways than the other types,
+because you can put anything you want in it. Remember everything in
+the vector must be of the same basic data type, but a list can have
+different data types:
When printing the object structure with str(), we see
+the data types of all elements:
+
+
R
+
+
+str(list_example)
+
+
+
OUTPUT
+
+
List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i
+
+
What is the use of lists? They can organize data of different
+types. For example, you can organize different tables that
+belong together, similar to spreadsheets in Excel. But there are many
+other uses, too.
+
We will see another example that will maybe surprise you in the next
+chapter.
+
To retrieve one of the elements of a list, use the double
+bracket:
+
+
R
+
+
+list_example[[2]]
+
+
+
OUTPUT
+
+
[1] "a"
+
+
The elements of lists also can have names, they can
+be given by prepending them to the values, separated by an equals
+sign:
+
+
R
+
+
+another_list<-list(title ="Numbers", numbers =1:10, data =TRUE)
+another_list
This results in a named list. Now we have a new
+function of our object! We can access single elements by an additional
+way!
+
+
R
+
+
+another_list$title
+
+
+
OUTPUT
+
+
[1] "Numbers"
+
+
+
Names
+
+
With names, we can give meaning to elements. It is the first time
+that we do not only have the data, but also explaining
+information. It is metadata that can be stuck to the object
+like a label. In R, this is called an attribute. Some
+attributes enable us to do more with our object, for example, like here,
+accessing an element by a self-defined name.
+
+
Accessing vectors and lists by name
+
We have already seen how to generate a named list. The way to
+generate a named vector is very similar. You have seen this function
+before:
The way to retrieve elements is different, though:
+
+
R
+
+
+pizza_price["pizzasubito"]
+
+
+
OUTPUT
+
+
pizzasubito
+5.64
+
+
The approach used for the list does not work:
+
+
R
+
+
+pizza_price$pizzafresh
+
+
+
ERROR
+
+
Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors
+
+
It will pay off if you remember this error message, you will meet it
+in your own analyses. It means that you have just tried accessing an
+element like it was in a list, but it is actually in a vector.
+
+
+
Accessing and changing names
+
If you are only interested in the names, use the names()
+function:
+
+
R
+
+
+names(pizza_price)
+
+
+
OUTPUT
+
+
[1] "pizzasubito" "pizzafresh" "callapizza"
+
+
We have seen how to access and change single elements of a vector.
+The same is possible for names:
What is the data type of the names of pizza_price? You
+can find out using the str() or typeof()
+functions.
+
+
+
+
+
+
+
+
+
You get the names of an object by wrapping the object name inside
+names(...). Similarly, you get the data type of the names
+by again wrapping the whole code in typeof(...):
+
typeof(names(pizza))
+
alternatively, use a new variable if this is easier for you to
+read:
+
n<-names(pizza)
+typeof(n)
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Instead of just changing some of the names a vector/list already has,
+you can also set all names of an object by writing code like (replace
+ALL CAPS text):
+
names(OBJECT)<-CHARACTER_VECTOR
+
Create a vector that gives the number for each letter in the
+alphabet!
+
Generate a vector called letter_no with the sequence of
+numbers from 1 to 26!
+
R has a built-in object called LETTERS. It is a
+26-character vector, from A to Z. Set the names of the number sequence
+to this 26 letters
+
Test yourself by calling letter_no["B"], which should
+give you the number 2!
+
+
+
+
+
+
+
+
+
letter_no<-1:26# or seq(1,26)
+names(letter_no)<-LETTERS
+letter_no["B"]
+
+
+
+
+
+
Data frames
+
+
We have data frames at the very beginning of this lesson, they
+represent a table of data. We didn’t go much further into detail with
+our example cat data frame:
We can now understand something a bit surprising in our data.frame;
+what happens if we run:
+
+
R
+
+
+typeof(cats)
+
+
+
OUTPUT
+
+
[1] "list"
+
+
We see that data.frames look like lists ‘under the hood’. Think again
+what we heard about what lists can be used for:
+
+
Lists organize data of different types
+
+
Columns of a data frame are vectors of different types, that are
+organized by belonging to the same table.
+
A data.frame is really a list of vectors. It is a special list in
+which all the vectors must have the same length.
+
How is this “special”-ness written into the object, so that R does
+not treat it like any other list, but as a table?
+
+
R
+
+
+class(cats)
+
+
+
OUTPUT
+
+
[1] "data.frame"
+
+
A class, just like names, is an attribute attached
+to the object. It tells us what this object means for humans.
+
You might wonder: Why do we need another
+what-type-of-object-is-this-function? We already have
+typeof()? That function tells us how the object is
+constructed in the computer. The class is
+the meaning of the object for humans. Consequently,
+what typeof() returns is fixed in R (mainly the
+five data types), whereas the output of class() is
+diverse and extendable by R packages.
+
In our cats example, we have an integer, a double and a
+logical variable. As we have seen already, each column of data.frame is
+a vector.
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
+
R
+
+
+cats[,1]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
+
R
+
+
+typeof(cats[,1])
+
+
+
OUTPUT
+
+
[1] "character"
+
+
+
R
+
+
+str(cats[,1])
+
+
+
OUTPUT
+
+
chr [1:3] "calico" "black" "tabby"
+
+
Each row is an observation of different variables, itself a
+data.frame, and thus can be composed of elements of different types.
There are several subtly different ways to call variables,
+observations and elements from data.frames:
+
cats[1]
+
cats[[1]]
+
cats$coat
+
cats["coat"]
+
cats[1, 1]
+
cats[, 1]
+
cats[1, ]
+
Try out these examples and explain what is returned by each one.
+
Hint: Use the function typeof() to examine what
+is returned in each case.
+
+
+
+
+
+
+
+
+
+
R
+
+
+cats[1]
+
+
+
OUTPUT
+
+
coat
+1 calico
+2 black
+3 tabby
+
+
We can think of a data frame as a list of vectors. The single brace
+[1] returns the first slice of the list, as another list.
+In this case it is the first column of the data frame.
+
+
R
+
+
+cats[[1]]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
The double brace [[1]] returns the contents of the list
+item. In this case it is the contents of the first column, a
+vector of type character.
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
This example uses the $ character to address items by
+name. coat is the first column of the data frame, again a
+vector of type character.
+
+
R
+
+
+cats["coat"]
+
+
+
OUTPUT
+
+
coat
+1 calico
+2 black
+3 tabby
+
+
Here we are using a single brace ["coat"] replacing the
+index number with the column name. Like example 1, the returned object
+is a list.
+
+
R
+
+
+cats[1, 1]
+
+
+
OUTPUT
+
+
[1] "calico"
+
+
This example uses a single brace, but this time we provide row and
+column coordinates. The returned object is the value in row 1, column 1.
+The object is a vector of type character.
+
+
R
+
+
+cats[, 1]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
Like the previous example we use single braces and provide row and
+column coordinates. The row coordinate is not specified, R interprets
+this missing value as all the elements in this column and
+returns them as a vector.
+
+
R
+
+
+cats[1, ]
+
+
+
OUTPUT
+
+
coat weight likes_string
+1 calico 2.1 TRUE
+
+
Again we use the single brace with row and column coordinates. The
+column coordinate is not specified. The return value is a list
+containing all the values in the first row.
+
+
+
+
+
+
+
+
+
+
Tip: Renaming data frame columns
+
+
+
Data frames have column names, which can be accessed with the
+names() function.
+
+
R
+
+
+names(cats)
+
+
+
OUTPUT
+
+
[1] "coat" "weight" "likes_string"
+
+
If you want to rename the second column of cats, you can
+assign a new name to the second element of names(cats).
Because a matrix is a vector with added dimension attributes,
+length gives you the total number of elements in the
+matrix.
+
+
+
+
+
+
+
+
+
+
Challenge 7
+
+
+
Make another matrix, this time containing the numbers 1:50, with 5
+columns and 10 rows. Did the matrix function fill your
+matrix by column, or by row, as its default behaviour? See if you can
+figure out how to change this. (hint: read the documentation for
+matrix!)
+
+
+
+
+
+
+
+
+
Make another matrix, this time containing the numbers 1:50, with 5
+columns and 10 rows. Did the matrix function fill your
+matrix by column, or by row, as its default behaviour? See if you can
+figure out how to change this. (hint: read the documentation for
+matrix!)
+
+
R
+
+
+x<-matrix(1:50, ncol=5, nrow=10)
+x<-matrix(1:50, ncol=5, nrow=10, byrow =TRUE)# to fill by row
+
+
+
+
+
+
+
+
+
+
+
Challenge 8
+
+
+
Create a list of length two containing a character vector for each of
+the sections in this part of the workshop:
+
Data types
+
Data structures
+
Populate each character vector with the names of the data types and
+data structures we’ve seen so far.
Note: it’s nice to make a list in big writing on the board or taped
+to the wall listing all of these types and structures - leave it up for
+the rest of the workshop to remind people of the importance of these
+basics.
+
+
+
+
+
+
+
+
+
+
Challenge 9
+
+
+
Consider the R output of the matrix below:
+
+
OUTPUT
+
+
[,1] [,2]
+[1,] 4 1
+[2,] 9 5
+[3,] 10 7
+
+
What was the correct command used to write this matrix? Examine each
+command and try to figure out the correct one before typing them. Think
+about what matrices the other commands will produce.
What was the correct command used to write this matrix? Examine each
+command and try to figure out the correct one before typing them. Think
+about what matrices the other commands will produce.
Display basic properties of data frames including size and class of
+the columns, names, and first few rows.
+
+
+
+
+
+
At this point, you’ve seen it all: in the last lesson, we toured all
+the basic data types and data structures in R. Everything you do will be
+a manipulation of those tools. But most of the time, the star of the
+show is the data frame—the table that we created by loading information
+from a csv file. In this lesson, we’ll learn a few more things about
+working with data frames.
+
Adding columns and rows in data frames
+
+
We already learned that the columns of a data frame are vectors, so
+that our data are consistent in type throughout the columns. As such, if
+we want to add a new column, we can start by making a new vector:
coat weight likes_string age
+1 calico 2.1 1 2
+2 black 5.0 0 3
+3 tabby 3.2 1 5
+
+
Notice the comma with nothing after it to indicate that we want to
+drop the entire fourth row.
+
Note: we could also remove several rows at once by putting the row
+numbers inside of a vector, for example:
+cats[c(-3,-4), ]
+
Removing columns
+
+
We can also remove columns in our data frame. What if we want to
+remove the column “age”. We can remove it in two ways, by variable
+number or by index.
Notice the comma with nothing before it, indicating we want to keep
+all of the rows.
+
Alternatively, we can drop the column by using the index name and the
+%in% operator. The %in% operator goes through
+each element of its left argument, in this case the names of
+cats, and asks, “Does this element occur in the second
+argument?”
The key to remember when adding data to a data frame is that
+columns are vectors and rows are lists. We can also glue two
+data frames together with rbind:
You can create a new data frame right from within R with the
+following syntax:
+
+
R
+
+
+df<-data.frame(id =c("a", "b", "c"),
+ x =1:3,
+ y =c(TRUE, TRUE, FALSE))
+
+
Make a data frame that holds the following information for
+yourself:
+
first name
+
last name
+
lucky number
+
Then use rbind to add an entry for the people sitting
+beside you. Finally, use cbind to add a column with each
+person’s answer to the question, “Is it time for coffee break?”
So far, you have seen the basics of manipulating data frames with our
+cat data; now let’s use those skills to digest a more realistic dataset.
+Let’s read in the gapminder dataset that we downloaded
+previously:
+
+
R
+
+
+gapminder<-read.csv("data/gapminder_data.csv")
+
+
+
+
+
+
+
Miscellaneous Tips
+
+
+
Another type of file you might encounter are tab-separated value
+files (.tsv). To specify a tab as a separator, use "\\t" or
+read.delim().
+
Files can also be downloaded directly from the Internet into a
+local folder of your choice onto your computer using the
+download.file function. The read.csv function
+can then be executed to read the downloaded file from the download
+location, for example,
Alternatively, you can also read in files directly into R from the
+Internet by replacing the file paths with a web address in
+read.csv. One should note that in doing this no local copy
+of the csv file is first saved onto your computer. For example,
You can read directly from excel spreadsheets without converting
+them to plain text first by using the readxl
+package.
+
The argument “stringsAsFactors” can be useful to tell R how to
+read strings either as factors or as character strings. In R versions
+after 4.0, all strings are read-in as characters by default, but in
+earlier versions of R, strings are read-in as factors by default. For
+more information, see the call-out in the
+previous episode.
+
+
+
+
Let’s investigate gapminder a bit; the first thing we should always
+do is check out what the data looks like with str:
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+
+
An additional method for examining the structure of gapminder is to
+use the summary function. This function can be used on
+various objects in R. For data frames, summary yields a
+numeric, tabular, or descriptive summary of each column. Numeric or
+integer columns are described by the descriptive statistics (quartiles
+and mean), and character columns by its length, class, and mode.
+
+
R
+
+
+summary(gapminder)
+
+
+
OUTPUT
+
+
country year pop continent
+ Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
+ Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
+ Mode :character Median :1980 Median :7.024e+06 Mode :character
+ Mean :1980 Mean :2.960e+07
+ 3rd Qu.:1993 3rd Qu.:1.959e+07
+ Max. :2007 Max. :1.319e+09
+ lifeExp gdpPercap
+ Min. :23.60 Min. : 241.2
+ 1st Qu.:48.20 1st Qu.: 1202.1
+ Median :60.71 Median : 3531.8
+ Mean :59.47 Mean : 7215.3
+ 3rd Qu.:70.85 3rd Qu.: 9325.5
+ Max. :82.60 Max. :113523.1
+
+
Along with the str and summary functions,
+we can examine individual columns of the data frame with our
+typeof function:
We can also interrogate the data frame for information about its
+dimensions; remembering that str(gapminder) said there were
+1704 observations of 6 variables in gapminder, what do you think the
+following will produce, and why?
+
+
R
+
+
+length(gapminder)
+
+
+
OUTPUT
+
+
[1] 6
+
+
A fair guess would have been to say that the length of a data frame
+would be the number of rows it has (1704), but this is not the case;
+remember, a data frame is a list of vectors and factors:
+
+
R
+
+
+typeof(gapminder)
+
+
+
OUTPUT
+
+
[1] "list"
+
+
When length gave us 6, it’s because gapminder is built
+out of a list of 6 columns. To get the number of rows and columns in our
+dataset, try:
+
+
R
+
+
+nrow(gapminder)
+
+
+
OUTPUT
+
+
[1] 1704
+
+
+
R
+
+
+ncol(gapminder)
+
+
+
OUTPUT
+
+
[1] 6
+
+
Or, both at once:
+
+
R
+
+
+dim(gapminder)
+
+
+
OUTPUT
+
+
[1] 1704 6
+
+
We’ll also likely want to know what the titles of all the columns
+are, so we can ask for them later:
At this stage, it’s important to ask ourselves if the structure R is
+reporting matches our intuition or expectations; do the basic data types
+reported for each column make sense? If not, we need to sort any
+problems out now before they turn into bad surprises down the road,
+using what we’ve learned about how R interprets data, and the importance
+of strict consistency in how we record our data.
+
Once we’re happy that the data types and structures seem reasonable,
+it’s time to start digging into our data proper. Check out the first few
+lines:
+
+
R
+
+
+head(gapminder)
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+4 Afghanistan 1967 11537966 Asia 34.020 836.1971
+5 Afghanistan 1972 13079460 Asia 36.088 739.9811
+6 Afghanistan 1977 14880372 Asia 38.438 786.1134
+
+
+
+
+
+
+
Challenge 2
+
+
+
It’s good practice to also check the last few lines of your data and
+some in the middle. How would you do this?
+
Searching for ones specifically in the middle isn’t too hard, but we
+could ask for a few lines at random. How would you code this?
+
+
+
+
+
+
+
+
+
To check the last few lines it’s relatively simple as R already has a
+function for this:
+
+
R
+
+
+tail(gapminder)
+tail(gapminder, n =15)
+
+
What about a few arbitrary rows just in case something is odd in the
+middle?
+
+
Tip: There are several ways to achieve this.
+
The solution here presents one form of using nested functions, i.e. a
+function passed as an argument to another function. This might sound
+like a new concept, but you are already using it! Remember
+my_dataframe[rows, cols] will print to screen your data frame with the
+number of rows and columns you asked for (although you might have asked
+for a range or named columns for example). How would you get the last
+row if you don’t know how many rows your data frame has? R has a
+function for this. What about getting a (pseudorandom) sample? R also
+has a function for this.
+
+
R
+
+
+gapminder[sample(nrow(gapminder), 5), ]
+
+
+
+
+
+
+
To make sure our analysis is reproducible, we should put the code
+into a script file so we can come back to it later.
+
+
+
+
+
+
Challenge 3
+
+
+
Go to file -> new file -> R script, and write an R script to
+load in the gapminder dataset. Put it in the scripts/
+directory and add it to version control.
+
Run the script using the source function, using the file
+path as its argument (or by pressing the “source” button in
+RStudio).
+
+
+
+
+
+
+
+
+
The source function can be used to use a script within a
+script. Assume you would like to load the same type of file over and
+over again and therefore you need to specify the arguments to fit the
+needs of your file. Instead of writing the necessary argument again and
+again you could just write it once and save it as a script. Then, you
+can use source("Your_Script_containing_the_load_function")
+in a new script to use the function of that script without writing
+everything again. Check out ?source to find out more.
To run the script and load the data into the gapminder
+variable:
+
+
R
+
+
+source(file ="scripts/load-gapminder.R")
+
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Read the output of str(gapminder) again; this time, use
+what you’ve learned about lists and vectors, as well as the output of
+functions like colnames and dim to explain
+what everything that str prints out for gapminder means. If
+there are any parts you can’t interpret, discuss with your
+neighbors!
+
+
+
+
+
+
+
+
+
The object gapminder is a data frame with columns
+
+country and continent are character
+strings.
+
+year is an integer vector.
+
+pop, lifeExp, and gdpPercap
+are numeric vectors.
+
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
Use cbind() to add a new column to a data frame.
+
Use rbind() to add a new row to a data frame.
+
Remove rows from a data frame.
+
Use str(), summary(), nrow(),
+ncol(), dim(), colnames(),
+rownames(), head(), and typeof()
+to understand the structure of a data frame.
+
Read in a csv file using read.csv().
+
Understand what length() of a data frame
+represents.
In R, simple vectors containing character strings, numbers, or
+logical values are called atomic vectors because they can’t be
+further simplified.
+
+
+
+
So now that we’ve created a dummy vector to play with, how do we get
+at its contents?
+
Accessing elements using their indices
+
+
To extract elements of a vector we can give their corresponding
+index, starting from one:
+
+
R
+
+
+x[1]
+
+
+
OUTPUT
+
+
a
+5.4
+
+
+
R
+
+
+x[4]
+
+
+
OUTPUT
+
+
d
+4.8
+
+
It may look different, but the square brackets operator is a
+function. For vectors (and matrices), it means “get me the nth
+element”.
+
We can ask for multiple elements at once:
+
+
R
+
+
+x[c(1, 3)]
+
+
+
OUTPUT
+
+
a c
+5.4 7.1
+
+
Or slices of the vector:
+
+
R
+
+
+x[1:4]
+
+
+
OUTPUT
+
+
a b c d
+5.4 6.2 7.1 4.8
+
+
the : operator creates a sequence of numbers from the
+left element to the right.
+
+
R
+
+
+1:4
+
+
+
OUTPUT
+
+
[1] 1 2 3 4
+
+
+
R
+
+
+c(1, 2, 3, 4)
+
+
+
OUTPUT
+
+
[1] 1 2 3 4
+
+
We can ask for the same element multiple times:
+
+
R
+
+
+x[c(1,1,3)]
+
+
+
OUTPUT
+
+
a a c
+5.4 5.4 7.1
+
+
If we ask for an index beyond the length of the vector, R will return
+a missing value:
+
+
R
+
+
+x[6]
+
+
+
OUTPUT
+
+
<NA>
+ NA
+
+
This is a vector of length one containing an NA, whose
+name is also NA.
+
If we ask for the 0th element, we get an empty vector:
+
+
R
+
+
+x[0]
+
+
+
OUTPUT
+
+
named numeric(0)
+
+
+
+
+
+
+
Vector numbering in R starts at 1
+
+
+
In many programming languages (C and Python, for example), the first
+element of a vector has an index of 0. In R, the first element is 1.
+
+
+
+
Skipping and removing elements
+
+
If we use a negative number as the index of a vector, R will return
+every element except for the one specified:
+
+
R
+
+
+x[-2]
+
+
+
OUTPUT
+
+
a c d e
+5.4 7.1 4.8 7.5
+
+
We can skip multiple elements:
+
+
R
+
+
+x[c(-1, -5)]# or x[-c(1,5)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
+
+
+
+
Tip: Order of operations
+
+
+
A common trip up for novices occurs when trying to skip slices of a
+vector. It’s natural to try to negate a sequence like so:
+
+
R
+
+
+x[-1:3]
+
+
This gives a somewhat cryptic error:
+
+
ERROR
+
+
Error in x[-1:3]: only 0's may be mixed with negative subscripts
+
+
But remember the order of operations. : is really a
+function. It takes its first argument as -1, and its second as 3, so
+generates the sequence of numbers: c(-1, 0, 1, 2, 3).
+
The correct solution is to wrap that function call in brackets, so
+that the - operator applies to the result:
+
+
R
+
+
+x[-(1:3)]
+
+
+
OUTPUT
+
+
d e
+4.8 7.5
+
+
+
+
+
To remove elements from a vector, we need to assign the result back
+into the variable:
Come up with at least 2 different commands that will produce the
+following output:
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
After you find 2 different commands, compare notes with your
+neighbour. Did you have different strategies?
+
+
+
+
+
+
+
+
+
+
R
+
+
+x[2:4]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
R
+
+
+x[-c(1,5)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
R
+
+
+x[c(2,3,4)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
+
+
+
Subsetting by name
+
+
We can extract elements by using their name, instead of extracting by
+index:
+
+
R
+
+
+x<-c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5)# we can name a vector 'on the fly'
+x[c("a", "c")]
+
+
+
OUTPUT
+
+
a c
+5.4 7.1
+
+
This is usually a much more reliable way to subset objects: the
+position of various elements can often change when chaining together
+subsetting operations, but the names will always remain the same!
+
Subsetting through other logical operations
+
+
We can also use any logical vector to subset:
+
+
R
+
+
+x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
+
+
+
OUTPUT
+
+
c e
+7.1 7.5
+
+
Since comparison operators (e.g. >,
+<, ==) evaluate to logical vectors, we can
+also use them to succinctly subset vectors: the following statement
+gives the same result as the previous one.
+
+
R
+
+
+x[x>7]
+
+
+
OUTPUT
+
+
c e
+7.1 7.5
+
+
Breaking it down, this statement first evaluates x>7,
+generating a logical vector
+c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the
+elements of x corresponding to the TRUE
+values.
+
We can use == to mimic the previous method of indexing
+by name (remember you have to use == rather than
+= for comparisons):
+
+
R
+
+
+x[names(x)=="a"]
+
+
+
OUTPUT
+
+
a
+5.4
+
+
+
+
+
+
+
Tip: Combining logical conditions
+
+
+
We often want to combine multiple logical criteria. For example, we
+might want to find all the countries that are located in Asia
+or Europe and have life expectancies
+within a certain range. Several operations for combining logical vectors
+exist in R:
+
+&, the “logical AND” operator: returns
+TRUE if both the left and right are TRUE.
+
+|, the “logical OR” operator: returns
+TRUE, if either the left or right (or both) are
+TRUE.
+
You may sometimes see && and ||
+instead of & and |. These two-character
+operators only look at the first element of each vector and ignore the
+remaining elements. In general you should not use the two-character
+operators in data analysis; save them for programming, i.e. deciding
+whether to execute a statement.
+
+!, the “logical NOT” operator: converts
+TRUE to FALSE and FALSE to
+TRUE. It can negate a single logical condition (eg
+!TRUE becomes FALSE), or a whole vector of
+conditions(eg !c(TRUE, FALSE) becomes
+c(FALSE, TRUE)).
+
Additionally, you can compare the elements within a single vector
+using the all function (which returns TRUE if
+every element of the vector is TRUE) and the
+any function (which returns TRUE if one or
+more elements of the vector are TRUE).
Write a subsetting command to return the values in x that are greater
+than 4 and less than 7.
+
+
+
+
+
+
+
+
+
+
R
+
+
+x_subset<-x[x<7&x>4]
+print(x_subset)
+
+
+
OUTPUT
+
+
a b d
+5.4 6.2 4.8
+
+
+
+
+
+
+
+
+
+
+
Tip: Non-unique names
+
+
+
You should be aware that it is possible for multiple elements in a
+vector to have the same name. (For a data frame, columns can have the
+same name — although R tries to avoid this — but row names must be
+unique.) Consider these examples:
+
+
R
+
+
+x<-1:3
+x
+
+
+
OUTPUT
+
+
[1] 1 2 3
+
+
+
R
+
+
+names(x)<-c('a', 'a', 'a')
+x
+
+
+
OUTPUT
+
+
a a a
+1 2 3
+
+
+
R
+
+
+x['a']# only returns first value
+
+
+
OUTPUT
+
+
a
+1
+
+
+
R
+
+
+x[names(x)=='a']# returns all three values
+
+
+
OUTPUT
+
+
a a a
+1 2 3
+
+
+
+
+
+
+
+
+
+
Tip: Getting help for operators
+
+
+
Remember you can search for help on operators by wrapping them in
+quotes: help("%in%") or ?"%in%".
+
+
+
+
Skipping named elements
+
+
Skipping or removing named elements is a little harder. If we try to
+skip one named element by negating the string, R complains (slightly
+obscurely) that it doesn’t know how to take the negative of a
+string:
+
+
R
+
+
+x<-c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5)# we start again by naming a vector 'on the fly'
+x[-"a"]
+
+
+
ERROR
+
+
Error in -"a": invalid argument to unary operator
+
+
However, we can use the != (not-equals) operator to
+construct a logical vector that will do what we want:
+
+
R
+
+
+x[names(x)!="a"]
+
+
+
OUTPUT
+
+
b c d e
+6.2 7.1 4.8 7.5
+
+
Skipping multiple named indices is a little bit harder still. Suppose
+we want to drop the "a" and "c" elements, so
+we try this:
+
+
R
+
+
+x[names(x)!=c("a","c")]
+
+
+
WARNING
+
+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+
OUTPUT
+
+
b c d e
+6.2 7.1 4.8 7.5
+
+
R did something, but it gave us a warning that we ought to
+pay attention to - and it apparently gave us the wrong answer
+(the "c" element is still included in the vector)!
+
So what does != actually do in this case? That’s an
+excellent question.
+
+
Recycling
+
Let’s take a look at the comparison component of this code:
+
+
R
+
+
+names(x)!=c("a", "c")
+
+
+
WARNING
+
+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+
OUTPUT
+
+
[1] FALSE TRUE TRUE TRUE TRUE
+
+
Why does R give TRUE as the third element of this
+vector, when names(x)[3] != "c" is obviously false? When
+you use !=, R tries to compare each element of the left
+argument with the corresponding element of its right argument. What
+happens when you compare vectors of different lengths?
+
When one vector is shorter than the other, it gets
+recycled:
+
In this case R repeatsc("a", "c") as
+many times as necessary to match names(x), i.e. we get
+c("a","c","a","c","a"). Since the recycled "a"
+doesn’t match the third element of names(x), the value of
+!= is TRUE. Because in this case the longer
+vector length (5) isn’t a multiple of the shorter vector length (2), R
+printed a warning message. If we had been unlucky and
+names(x) had contained six elements, R would
+silently have done the wrong thing (i.e., not what we intended
+it to do). This recycling rule can can introduce hard-to-find and subtle
+bugs!
+
The way to get R to do what we really want (match each
+element of the left argument with all of the elements of the
+right argument) it to use the %in% operator. The
+%in% operator goes through each element of its left
+argument, in this case the names of x, and asks, “Does this
+element occur in the second argument?”. Here, since we want to
+exclude values, we also need a ! operator to
+change “in” to “not in”:
+
+
R
+
+
+x[!names(x)%in%c("a","c")]
+
+
+
OUTPUT
+
+
b d e
+6.2 4.8 7.5
+
+
+
+
+
+
+
Challenge 3
+
+
+
Selecting elements of a vector that match any of a list of components
+is a very common data analysis task. For example, the gapminder data set
+contains country and continent variables, but
+no information between these two scales. Suppose we want to pull out
+information from southeast Asia: how do we set up an operation to
+produce a logical vector that is TRUE for all of the
+countries in southeast Asia and FALSE otherwise?
+
Suppose you have these data:
+
+
R
+
+
+seAsia<-c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder<-read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries<-unique(as.character(gapminder$country))
+
+
There’s a wrong way (using only ==), which will give you
+a warning; a clunky way (using the logical operators == and
+|); and an elegant way (using %in%). See
+whether you can come up with all three and explain how they (don’t)
+work.
+
+
+
+
+
+
+
+
+
The wrong way to do this problem is
+countries==seAsia. This gives a warning
+("In countries == seAsia : longer object length is not a multiple of shorter object length")
+and the wrong answer (a vector of all FALSE values),
+because none of the recycled values of seAsia happen to
+line up correctly with matching values in country.
+
The clunky (but technically correct) way to do this
+problem is
(or countries==seAsia[1] | countries==seAsia[2] | ...).
+This gives the correct values, but hopefully you can see how awkward it
+is (what if we wanted to select countries from a much longer list?).
+
The best way to do this problem is
+countries %in% seAsia, which is both correct and easy to
+type (and read).
+
+
+
+
+
+
Handling special values
+
+
At some point you will encounter functions in R that cannot handle
+missing, infinite, or undefined data.
+
There are a number of special functions you can use to filter out
+this data:
+
+is.na will return all positions in a vector, matrix, or
+data.frame containing NA (or NaN)
+
likewise, is.nan, and is.infinite will do
+the same for NaN and Inf.
+
+is.finite will return all positions in a vector,
+matrix, or data.frame that do not contain NA,
+NaN or Inf.
+
+na.omit will filter out all missing values from a
+vector
+
Factor subsetting
+
+
Now that we’ve explored the different ways to subset vectors, how do
+we subset the other data structures?
+
Factor subsetting works the same way as vector subsetting.
Unlike vectors, if we try to access a row or column outside of the
+matrix, R will throw an error:
+
+
R
+
+
+m[, c(3,6)]
+
+
+
ERROR
+
+
Error in m[, c(3, 6)]: subscript out of bounds
+
+
+
+
+
+
+
Tip: Higher dimensional arrays
+
+
+
when dealing with multi-dimensional arrays, each argument to
+[ corresponds to a dimension. For example, a 3D array, the
+first three arguments correspond to the rows, columns, and depth
+dimension.
+
+
+
+
Because matrices are vectors, we can also subset using only one
+argument:
+
+
R
+
+
+m[5]
+
+
+
OUTPUT
+
+
[1] 0.3295078
+
+
This usually isn’t useful, and often confusing to read. However it is
+useful to note that matrices are laid out in column-major
+format by default. That is the elements of the vector are arranged
+column-wise:
+
+
R
+
+
+matrix(1:6, nrow=2, ncol=3)
+
+
+
OUTPUT
+
+
[,1] [,2] [,3]
+[1,] 1 3 5
+[2,] 2 4 6
+
+
If you wish to populate the matrix by row, use
+byrow=TRUE:
+
+
R
+
+
+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
+
+
+
OUTPUT
+
+
[,1] [,2] [,3]
+[1,] 1 2 3
+[2,] 4 5 6
+
+
Matrices can also be subsetted using their rownames and column names
+instead of their row and column indices.
Which of the following commands will extract the values 11 and
+14?
+
A. m[2,4,2,5]
+
B. m[2:5]
+
C. m[4:5,2]
+
D. m[2,c(4,5)]
+
+
+
+
+
+
+
+
+
D
+
+
+
+
+
List subsetting
+
+
Now we’ll introduce some new subsetting operators. There are three
+functions used to subset lists. We’ve already seen these when learning
+about atomic vectors and matrices: [, [[, and
+$.
+
Using [ will always return a list. If you want to
+subset a list, but not extract an element, then you
+will likely use [.
+
+
R
+
+
+xlist<-list(a ="Software Carpentry", b =1:10, data =head(mtcars))
+xlist[1]
+
+
+
OUTPUT
+
+
$a
+[1] "Software Carpentry"
+
+
This returns a list with one element.
+
We can subset elements of a list exactly the same way as atomic
+vectors using [. Comparison operations however won’t work
+as they’re not recursive, they will try to condition on the data
+structures in each element of the list, not the individual elements
+within those data structures.
+xlist<-list(a ="Software Carpentry", b =1:10, data =head(mtcars))
+
+
Using your knowledge of both list and vector subsetting, extract the
+number 2 from xlist. Hint: the number 2 is contained within the “b” item
+in the list.
+
+
+
+
+
+
+
+
+
+
R
+
+
+xlist$b[2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
R
+
+
+xlist[[2]][2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
R
+
+
+xlist[["b"]][2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
+
+
+
+
+
+
+
+
Challenge 6
+
+
+
Given a linear model:
+
+
R
+
+
+mod<-aov(pop~lifeExp, data=gapminder)
+
+
Extract the residual degrees of freedom (hint:
+attributes() will help you)
+
+
+
+
+
+
+
+
+
+
R
+
+
+attributes(mod)## `df.residual` is one of the names of `mod`
+
+
+
R
+
+
+mod$df.residual
+
+
+
+
+
+
Data frames
+
+
Remember the data frames are lists underneath the hood, so similar
+rules apply. However they are also two dimensional objects:
+
[ with one argument will act the same way as for lists,
+where each list element corresponds to a column. The resulting object
+will be a data frame:
Similarly, [[ will act to extract a single
+column:
+
+
R
+
+
+head(gapminder[["lifeExp"]])
+
+
+
OUTPUT
+
+
[1] 28.801 30.332 31.997 34.020 36.088 38.438
+
+
And $ provides a convenient shorthand to extract columns
+by name:
+
+
R
+
+
+head(gapminder$year)
+
+
+
OUTPUT
+
+
[1] 1952 1957 1962 1967 1972 1977
+
+
With two arguments, [ behaves the same way as for
+matrices:
+
+
R
+
+
+gapminder[1:3,]
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+
+
If we subset a single row, the result will be a data frame (because
+the elements are mixed types):
+
+
R
+
+
+gapminder[3,]
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+
+
But for a single column the result will be a vector (this can be
+changed with the third argument, drop = FALSE).
+
+
+
+
+
+
Challenge 7
+
+
+
Fix each of the following common data frame subsetting errors:
+
Extract observations collected for the year 1957
+
+
R
+
+
gapminder[gapminder$year =1957,]
+
+
Extract all columns except 1 through to 4
+
+
R
+
+
+gapminder[,-1:4]
+
+
Extract the rows where the life expectancy is longer the 80
+years
+
+
R
+
+
+gapminder[gapminder$lifeExp>80]
+
+
Extract the first row, and the fourth and fifth columns
+(continent and lifeExp).
+
+
R
+
+
+gapminder[1, 4, 5]
+
+
Advanced: extract rows that contain information for the years 2002
+and 2007
+
+
R
+
+
+gapminder[gapminder$year==2002|2007,]
+
+
+
+
+
+
+
+
+
+
Fix each of the following common data frame subsetting errors:
Write conditional statements with if...else statements
+and ifelse().
+
Write and understand for() loops.
+
+
+
+
+
+
Often when we’re coding we want to control the flow of our actions.
+This can be done by setting actions to occur only if a condition or a
+set of conditions are met. Alternatively, we can also set an action to
+occur a particular number of times.
+
There are several ways you can control flow in R. For conditional
+statements, the most commonly used approaches are the constructs:
+
+
R
+
+
# if
+if (condition is true) {
+ perform action
+}
+
+# if ... else
+if (condition is true) {
+ perform action
+} else { # that is, if the condition is false,
+ perform alternative action
+}
+
+
Say, for example, that we want R to print a message if a variable
+x has a particular value:
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}
+
+x
+
+
+
OUTPUT
+
+
[1] 8
+
+
The print statement does not appear in the console because x is not
+greater than 10. To print a different message for numbers less than 10,
+we can add an else statement.
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}else{
+print("x is less than 10")
+}
+
+
+
OUTPUT
+
+
[1] "x is less than 10"
+
+
You can also test multiple conditions by using
+else if.
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}elseif(x>5){
+print("x is greater than 5, but less than 10")
+}else{
+print("x is less than 5")
+}
+
+
+
OUTPUT
+
+
[1] "x is greater than 5, but less than 10"
+
+
Important: when R evaluates the condition inside
+if() statements, it is looking for a logical element, i.e.,
+TRUE or FALSE. This can cause some headaches
+for beginners. For example:
+
+
R
+
+
+x<-4==3
+if(x){
+"4 equals 3"
+}else{
+"4 does not equal 3"
+}
+
+
+
OUTPUT
+
+
[1] "4 does not equal 3"
+
+
As we can see, the not equal message was printed because the vector x
+is FALSE
+
+
R
+
+
+x<-4==3
+x
+
+
+
OUTPUT
+
+
[1] FALSE
+
+
+
+
+
+
+
Challenge 1
+
+
+
Use an if() statement to print a suitable message
+reporting whether there are any records from 2002 in the
+gapminder dataset. Now do the same for 2012.
+
+
+
+
+
+
+
+
+
We will first see a solution to Challenge 1 which does not use the
+any() function. We first obtain a logical vector describing
+which element of gapminder$year is equal to
+2002:
+
+
R
+
+
+gapminder[(gapminder$year==2002),]
+
+
Then, we count the number of rows of the data.frame
+gapminder that correspond to the 2002:
The presence of any record for the year 2002 is equivalent to the
+request that rows2002_number is one or more:
+
+
R
+
+
+rows2002_number>=1
+
+
Putting all together, we obtain:
+
+
R
+
+
+if(nrow(gapminder[(gapminder$year==2002),])>=1){
+print("Record(s) for the year 2002 found.")
+}
+
+
All this can be done more quickly with any(). The
+logical condition can be expressed as:
+
+
R
+
+
+if(any(gapminder$year==2002)){
+print("Record(s) for the year 2002 found.")
+}
+
+
+
+
+
+
Did anyone get a warning message like this?
+
+
ERROR
+
+
Error in if (gapminder$year == 2012) {: the condition has length > 1
+
+
The if() function only accepts singular (of length 1)
+inputs, and therefore returns an error when you use it with a vector.
+The if() function will still run, but will only evaluate
+the condition in the first element of the vector. Therefore, to use the
+if() function, you need to make sure your input is singular
+(of length 1).
+
+
+
+
+
+
Tip: Built in ifelse()
+function
+
+
+
R accepts both if() and
+else if() statements structured as outlined above, but also
+statements using R’s built-in ifelse()
+function. This function accepts both singular and vector inputs and is
+structured as follows:
+
+
R
+
+
# ifelse function
+ifelse(condition is true, perform action, perform alternative action)
+
+
where the first argument is the condition or a set of conditions to
+be met, the second argument is the statement that is evaluated when the
+condition is TRUE, and the third statement is the statement
+that is evaluated when the condition is FALSE.
+
+
R
+
+
+y<--3
+ifelse(y<0, "y is a negative number", "y is either positive or zero")
+
+
+
OUTPUT
+
+
[1] "y is a negative number"
+
+
+
+
+
+
+
+
+
+
Tip: any() and
+all()
+
+
+
The any() function will return TRUE if at
+least one TRUE value is found within a vector, otherwise it
+will return FALSE. This can be used in a similar way to the
+%in% operator. The function all(), as the name
+suggests, will only return TRUE if all values in the vector
+are TRUE.
+
+
+
+
Repeating operations
+
+
If you want to iterate over a set of values, when the order of
+iteration is important, and perform the same operation on each, a
+for() loop will do the job. We saw for() loops
+in the shell
+lessons earlier. This is the most flexible of looping operations,
+but therefore also the hardest to use correctly. In general, the advice
+of many R users would be to learn about for()
+loops, but to avoid using for() loops unless the order of
+iteration is important: i.e. the calculation at each iteration depends
+on the results of previous iterations. If the order of iteration is not
+important, then you should learn about vectorized alternatives, such as
+the purrr package, as they pay off in computational
+efficiency.
We notice in the output that when the first index (i) is
+set to 1, the second index (j) iterates through its full
+set of indices. Once the indices of j have been iterated
+through, then i is incremented. This process continues
+until the last index has been used for each for() loop.
+
Rather than printing the results, we could write the loop output to a
+new object.
This approach can be useful, but ‘growing your results’ (building the
+result object incrementally) is computationally inefficient, so avoid it
+when you are iterating through a lot of values.
+
+
+
+
+
+
Tip: don’t grow your results
+
+
+
One of the biggest things that trips up novices and experienced R
+users alike, is building a results object (vector, list, matrix, data
+frame) as your for loop progresses. Computers are very bad at handling
+this, so your calculations can very quickly slow to a crawl. It’s much
+better to define an empty results object before hand of appropriate
+dimensions, rather than initializing an empty object without dimensions.
+So if you know the end result will be stored in a matrix like above,
+create an empty matrix with 5 row and 5 columns, then at each iteration
+store the results in the appropriate location.
+
+
+
+
A better way is to define your (empty) output object before filling
+in the values. For this example, it looks more involved, but is still
+more efficient.
Sometimes you will find yourself needing to repeat an operation as
+long as a certain condition is met. You can do this with a
+while() loop.
+
+
R
+
+
while(this condition is true){
+ do a thing
+}
+
+
R will interpret a condition being met as “TRUE”.
+
As an example, here’s a while loop that generates random numbers from
+a uniform distribution (the runif() function) between 0 and
+1 until it gets one that’s less than 0.1.
while() loops will not always be appropriate. You have
+to be particularly careful that you don’t end up stuck in an infinite
+loop because your condition is always met and hence the while statement
+never terminates.
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Compare the objects output_vector and
+output_vector2. Are they the same? If not, why not? How
+would you change the last block of code to make
+output_vector2 the same as output_vector?
+
+
+
+
+
+
+
+
+
We can check whether the two vectors are identical using the
+all() function:
+
+
R
+
+
+all(output_vector==output_vector2)
+
+
However, all the elements of output_vector can be found
+in output_vector2:
+
+
R
+
+
+all(output_vector%in%output_vector2)
+
+
and vice versa:
+
+
R
+
+
+all(output_vector2%in%output_vector)
+
+
therefore, the element in output_vector and
+output_vector2 are just sorted in a different order. This
+is because as.vector() outputs the elements of an input
+matrix going over its column. Taking a look at
+output_matrix, we can notice that we want its elements by
+rows. The solution is to transpose the output_matrix. We
+can do it either by calling the transpose function t() or
+by inputting the elements in the right order. The first solution
+requires to change the original
+
+
R
+
+
+output_vector2<-as.vector(output_matrix)
+
+
into
+
+
R
+
+
+output_vector2<-as.vector(t(output_matrix))
+
+
The second solution requires to change
+
+
R
+
+
+output_matrix[i, j]<-temp_output
+
+
into
+
+
R
+
+
+output_matrix[j, i]<-temp_output
+
+
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Write a script that loops through the gapminder data by
+continent and prints out whether the mean life expectancy is smaller or
+larger than 50 years.
+
+
+
+
+
+
+
+
+
Step 1: We want to make sure we can extract all the
+unique values of the continent vector
Step 2: We also need to loop over each of these
+continents and calculate the average life expectancy for each
+subset of data. We can do that as follows:
+
Loop over each of the unique values of ‘continent’
+
For each value of continent, create a temporary variable storing
+that subset
+
Return the calculated life expectancy to the user by printing the
+output:
Step 3: The exercise only wants the output printed
+if the average life expectancy is less than 50 or greater than 50. So we
+need to add an if() condition before printing, which
+evaluates whether the calculated average life expectancy is above or
+below a threshold, and prints an output conditional on the result. We
+need to amend (3) from above:
+
3a. If the calculated life expectancy is less than some threshold (50
+years), return the continent and a statement that life expectancy is
+less than threshold, otherwise return the continent and a statement that
+life expectancy is greater than threshold:
+
+
R
+
+
+thresholdValue<-50
+
+for(iContinentinunique(gapminder$continent)){
+tmp<-mean(gapminder[gapminder$continent==iContinent, "lifeExp"])
+
+if(tmp<thresholdValue){
+cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+}else{
+cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+}# end if else condition
+rm(tmp)
+}# end for loop
+
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Modify the script from Challenge 3 to loop over each country. This
+time print out whether the life expectancy is smaller than 50, between
+50 and 70, or greater than 70.
+
+
+
+
+
+
+
+
+
We modify our solution to Challenge 3 by now adding two thresholds,
+lowerThreshold and upperThreshold and
+extending our if-else statements:
Write a script that loops over each country in the
+gapminder dataset, tests whether the country starts with a
+‘B’, and graphs life expectancy against time as a line graph if the mean
+life expectancy is under 50 years.
+
+
+
+
+
+
+
+
+
We will use the grep() command that was introduced in
+the Unix
+Shell lesson to find countries that start with “B.” Lets understand
+how to do this first. Following from the Unix shell section we may be
+tempted to try the following
+
+
R
+
+
+grep("^B", unique(gapminder$country))
+
+
But when we evaluate this command it returns the indices of the
+factor variable country that start with “B.” To get the
+values, we must add the value=TRUE option to the
+grep() command:
+
+
R
+
+
+grep("^B", unique(gapminder$country), value =TRUE)
+
+
We will now store these countries in a variable called
+candidateCountries, and then loop over each entry in the variable.
+Inside the loop, we evaluate the average life expectancy for each
+country, and if the average life expectancy is less than 50 we use
+base-plot to plot the evolution of average life expectancy using
+with() and subset():
+
+
R
+
+
+thresholdValue<-50
+candidateCountries<-grep("^B", unique(gapminder$country), value =TRUE)
+
+for(iCountryincandidateCountries){
+tmp<-mean(gapminder[gapminder$country==iCountry, "lifeExp"])
+
+if(tmp<thresholdValue){
+cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+
+with(subset(gapminder, country==iCountry),
+plot(year, lifeExp,
+ type ="o",
+ main =paste("Life Expectancy in", iCountry, "over time"),
+ ylab ="Life Expectancy",
+ xlab ="Year"
+)# end plot
+)# end with
+}# end if
+rm(tmp)
+}# end for loop
Today we’ll be learning about the ggplot2 package, because it is the
+most effective for creating publication-quality graphics.
+
ggplot2 is built on the grammar of graphics, the idea that any plot
+can be built from the same set of components: a data
+set, mapping aesthetics, and graphical
+layers:
+
Data sets are the data that you, the user,
+provide.
+
Mapping aesthetics are what connect the data to
+the graphics. They tell ggplot2 how to use your data to affect how the
+graph looks, such as changing what is plotted on the X or Y axis, or the
+size or color of different data points.
+
Layers are the actual graphical output from
+ggplot2. Layers determine what kinds of plot are shown (scatterplot,
+histogram, etc.), the coordinate system used (rectangular, polar,
+others), and other important aspects of the plot. The idea of layers of
+graphics may be familiar to you if you have used image editing programs
+like Photoshop, Illustrator, or Inkscape.
+
Let’s start off building an example using the gapminder data from
+earlier. The most basic function is ggplot, which lets R
+know that we’re creating a new plot. Any of the arguments we give the
+ggplot function are the global options for the
+plot: they apply to all layers on the plot.
+
+
R
+
+
+library("ggplot2")
+ggplot(data =gapminder)
+
+
Here we called ggplot and told it what data we want to
+show on our figure. This is not enough information for
+ggplot to actually draw anything. It only creates a blank
+slate for other elements to be added to.
+
Now we’re going to add in the mapping aesthetics
+using the aes function. aes tells
+ggplot how variables in the data map to
+aesthetic properties of the figure, such as which columns of
+the data should be used for the x and
+y locations.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))
+
+
Here we told ggplot we want to plot the “gdpPercap”
+column of the gapminder data frame on the x-axis, and the “lifeExp”
+column on the y-axis. Notice that we didn’t need to explicitly pass
+aes these columns
+(e.g. x = gapminder[, "gdpPercap"]), this is because
+ggplot is smart enough to know to look in the
+data for that column!
+
The final part of making our plot is to tell ggplot how
+we want to visually represent the data. We do this by adding a new
+layer to the plot using one of the
+geom functions.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point()
+
+
Here we used geom_point, which tells ggplot
+we want to visually represent the relationship between
+x and y as a scatterplot of
+points.
+
+
+
+
+
+
Challenge 1
+
+
+
Modify the example so that the figure shows how life expectancy has
+changed over time:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+geom_point()
+
+
Hint: the gapminder dataset has a column called “year”, which should
+appear on the x-axis.
+
+
+
+
+
+
+
+
+
Here is one possible solution:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =year, y =lifeExp))+geom_point()
+
+
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
In the previous examples and challenge we’ve used the
+aes function to tell the scatterplot geom
+about the x and y locations of each
+point. Another aesthetic property we can modify is the point
+color. Modify the code from the previous challenge to
+color the points by the “continent” column. What trends
+do you see in the data? Are they what you expected?
+
+
+
+
+
+
+
+
+
The solution presented below adds color=continent to the
+call of the aes function. The general trend seems to
+indicate an increased life expectancy over the years. On continents with
+stronger economies we find a longer life expectancy.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_point()
+
+
+
+
+
+
Layers
+
+
Using a scatterplot probably isn’t the best for visualizing change
+over time. Instead, let’s tell ggplot to visualize the data
+as a line plot:
Instead of adding a geom_point layer, we’ve added a
+geom_line layer.
+
However, the result doesn’t look quite as we might have expected: it
+seems to be jumping around a lot in each continent. Let’s try to
+separate the data by country, plotting one line for each country:
It’s important to note that each layer is drawn on top of the
+previous layer. In this example, the points have been drawn on top
+of the lines. Here’s a demonstration:
In this example, the aesthetic mapping of
+color has been moved from the global plot options in
+ggplot to the geom_line layer so it no longer
+applies to the points. Now we can clearly see that the points are drawn
+on top of the lines.
+
+
+
+
+
+
Tip: Setting an aesthetic to a value instead
+of a mapping
+
+
+
So far, we’ve seen how to use an aesthetic (such as
+color) as a mapping to a variable in the data.
+For example, when we use
+geom_line(mapping = aes(color=continent)), ggplot will give
+a different color to each continent. But what if we want to change the
+color of all lines to blue? You may think that
+geom_line(mapping = aes(color="blue")) should work, but it
+doesn’t. Since we don’t want to create a mapping to a specific variable,
+we can move the color specification outside of the aes()
+function, like this: geom_line(color="blue").
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Switch the order of the point and line layers from the previous
+example. What happened?
ggplot2 also makes it easy to overlay statistical models over the
+data. To demonstrate we’ll go back to our first example:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point()
+
+
Currently it’s hard to see the relationship between the points due to
+some strong outliers in GDP per capita. We can change the scale of units
+on the x axis using the scale functions. These control the
+mapping between the data values and visual values of an aesthetic. We
+can also modify the transparency of the points, using the alpha
+function, which is especially helpful when you have a large amount of
+data which is very clustered.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()
+
+
The scale_x_log10 function applied a transformation to
+the coordinate system of the plot, so that each multiple of 10 is evenly
+spaced from left to right. For example, a GDP per capita of 1,000 is the
+same horizontal distance away from a value of 10,000 as the 10,000 value
+is from 100,000. This helps to visualize the spread of the data along
+the x-axis.
+
+
+
+
+
+
Tip Reminder: Setting an aesthetic to a value
+instead of a mapping
+
+
+
Notice that we used geom_point(alpha = 0.5). As the
+previous tip mentioned, using a setting outside of the
+aes() function will cause this value to be used for all
+points, which is what we want in this case. But just like any other
+aesthetic setting, alpha can also be mapped to a variable in
+the data. For example, we can give a different transparency to each
+continent with
+geom_point(mapping = aes(alpha = continent)).
+
+
+
+
We can fit a simple relationship to the data by adding another layer,
+geom_smooth:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()+geom_smooth(method="lm")
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
We can make the line thicker by setting the
+size aesthetic in the geom_smooth
+layer:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()+geom_smooth(method="lm", size=1.5)
+
+
+
WARNING
+
+
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+generated.
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
There are two ways an aesthetic can be specified. Here we
+set the size aesthetic by passing it as an
+argument to geom_smooth. Previously in the lesson we’ve
+used the aes function to define a mapping between
+data variables and their visual representation.
+
+
+
+
+
+
Challenge 4a
+
+
+
Modify the color and size of the points on the point layer in the
+previous example.
+
Hint: do not use the aes function.
+
+
+
+
+
+
+
+
+
Here a possible solution: Notice that the color argument
+is supplied outside of the aes() function. This means that
+it applies to all data points on the graph and is not related to a
+specific variable.
Modify your solution to Challenge 4a so that the points are now a
+different shape and are colored by continent with new trendlines. Hint:
+The color argument can be used inside the aesthetic.
+
+
+
+
+
+
+
+
+
Here is a possible solution: Notice that supplying the
+color argument inside the aes() functions
+enables you to connect it to a certain variable. The shape
+argument, as you can see, modifies all data points the same way (it is
+outside the aes() call) while the color
+argument which is placed inside the aes() call modifies a
+point’s color based on its continent value.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp, color =continent))+
+geom_point(size=3, shape=17)+scale_x_log10()+
+geom_smooth(method="lm", size=1.5)
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
+
+
+
Multi-panel figures
+
+
Earlier we visualized the change in life expectancy over time across
+all countries in one plot. Alternatively, we can split this out over
+multiple panels by adding a layer of facet panels.
+
+
+
+
+
+
Tip
+
+
+
We start by making a subset of data including only countries located
+in the Americas. This includes 25 countries, which will begin to clutter
+the figure. Note that we apply a “theme” definition to rotate the x-axis
+labels to maintain readability. Nearly everything in ggplot2 is
+customizable.
The facet_wrap layer took a “formula” as its argument,
+denoted by the tilde (~). This tells R to draw a panel for each unique
+value in the country column of the gapminder dataset.
+
Modifying text
+
+
To clean this figure up for a publication we need to change some of
+the text elements. The x-axis is too cluttered, and the y axis should
+read “Life expectancy”, rather than the column name in the data
+frame.
+
We can do this by adding a couple of different layers. The
+theme layer controls the axis text, and overall text
+size. Labels for the axes, plot title and any legend can be set using
+the labs function. Legend titles are set using the same
+names we used in the aes specification. Thus below the
+color legend title is set using color = "Continent", while
+the title of a fill legend would be set using
+fill = "MyTitle".
+
+
R
+
+
+ggplot(data =americas, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_line()+facet_wrap(~country)+
+labs(
+ x ="Year", # x axis title
+ y ="Life expectancy", # y axis title
+ title ="Figure 1", # main title of figure
+ color ="Continent"# title of legend
+)+
+theme(axis.text.x =element_text(angle =90, hjust =1))
+
+
Exporting the plot
+
+
The ggsave() function allows you to export a plot
+created with ggplot. You can specify the dimension and resolution of
+your plot by adjusting the appropriate arguments (width,
+height and dpi) to create high quality
+graphics for publication. In order to save the plot from above, we first
+assign it to a variable lifeExp_plot, then tell
+ggsave to save that plot in png format to a
+directory called results. (Make sure you have a
+results/ folder in your working directory.)
+
+
R
+
+
+lifeExp_plot<-ggplot(data =americas, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_line()+facet_wrap(~country)+
+labs(
+ x ="Year", # x axis title
+ y ="Life expectancy", # y axis title
+ title ="Figure 1", # main title of figure
+ color ="Continent"# title of legend
+)+
+theme(axis.text.x =element_text(angle =90, hjust =1))
+
+ggsave(filename ="results/lifeExp.png", plot =lifeExp_plot, width =12, height =10, dpi =300, units ="cm")
+
+
There are two nice things about ggsave. First, it
+defaults to the last plot, so if you omit the plot argument
+it will automatically save the last plot you created with
+ggplot. Secondly, it tries to determine the format you want
+to save your plot in from the file extension you provide for the
+filename (for example .png or .pdf). If you
+need to, you can specify the format explicitly in the
+device argument.
+
This is a taste of what you can do with ggplot2. RStudio provides a
+really useful cheat
+sheet of the different layers available, and more extensive
+documentation is available on the ggplot2 website. All
+RStudio cheat sheets can be found here. Finally,
+if you have no idea how to change something, a quick Google search will
+usually send you to a relevant question and answer on Stack Overflow
+with reusable code to modify!
+
+
+
+
+
+
Challenge 5
+
+
+
Generate boxplots to compare life expectancy between the different
+continents during the available years.
+
Advanced:
+
Rename y axis as Life Expectancy.
+
Remove x axis labels.
+
+
+
+
+
+
+
+
+
Here a possible solution: xlab() and ylab()
+set labels for the x and y axes, respectively The axis title, text and
+ticks are attributes of the theme and must be modified within a
+theme() call.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =continent, y =lifeExp, fill =continent))+
+geom_boxplot()+facet_wrap(~year)+
+ylab("Life Expectancy")+
+theme(axis.title.x=element_blank(),
+ axis.text.x =element_blank(),
+ axis.ticks.x =element_blank())
+
+
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
Use ggplot2 to create plots.
+
Think about graphics in layers: aesthetics, geometry, statistics,
+scale transformation, and grouping.
How can I operate on all the elements of a vector at once?
+
+
+
+
+
+
+
Objectives
+
To understand vectorized operations in R.
+
+
+
+
+
+
Most of R’s functions are vectorized, meaning that the function will
+operate on all elements of a vector without needing to loop through and
+act on each element one at a time. This makes writing code more concise,
+easy to read, and less error prone.
+
+
R
+
+
+x<-1:4
+x*2
+
+
+
OUTPUT
+
+
[1] 2 4 6 8
+
+
The multiplication happened to each element of the vector.
+
We can also add two vectors together:
+
+
R
+
+
+y<-6:9
+x+y
+
+
+
OUTPUT
+
+
[1] 7 9 11 13
+
+
Each element of x was added to its corresponding element
+of y:
+
+
R
+
+
x:1234
+++++
+y:6789
+---------------
+791113
+
+
Here is how we would add two vectors together using a for loop:
Compare this to the output using vectorised operations.
+
+
R
+
+
+sum_xy<-x+y
+sum_xy
+
+
+
OUTPUT
+
+
[1] 7 9 11 13
+
+
+
+
+
+
+
Challenge 1
+
+
+
Let’s try this on the pop column of the
+gapminder dataset.
+
Make a new column in the gapminder data frame that
+contains population in units of millions of people. Check the head or
+tail of the data frame to make sure it worked.
+
+
+
+
+
+
+
+
+
Let’s try this on the pop column of the
+gapminder dataset.
+
Make a new column in the gapminder data frame that
+contains population in units of millions of people. Check the head or
+tail of the data frame to make sure it worked.
Operations can also be performed on vectors of unequal length,
+through a process known as recycling. This process
+automatically repeats the smaller vector until it matches the length of
+the larger vector. R will provide a warning if the larger vector is not
+a multiple of the smaller vector.
+
+
R
+
+
+x<-c(1, 2, 3)
+y<-c(1, 2, 3, 4, 5, 6, 7)
+x+y
+
+
+
WARNING
+
+
Warning in x + y: longer object length is not a multiple of shorter object
+length
+
+
+
OUTPUT
+
+
[1] 2 4 6 5 7 9 8
+
+
Vector x was recycled to match the length of vector
+y
Check argument conditions with stopifnot() in
+functions.
+
Test a function.
+
Set default values for function arguments.
+
Explain why we should divide programs into small, single-purpose
+functions.
+
+
+
+
+
+
If we only had one data set to analyze, it would probably be faster
+to load the file into a spreadsheet and use that to plot simple
+statistics. However, the gapminder data is updated periodically, and we
+may want to pull in that new information later and re-run our analysis
+again. We may also obtain similar data from a different source in the
+future.
+
In this lesson, we’ll learn how to write a function so that we can
+repeat several operations with a single command.
+
+
+
+
+
+
What is a function?
+
+
+
Functions gather a sequence of operations into a whole, preserving it
+for ongoing use. Functions provide:
+
a name we can remember and invoke it by
+
relief from the need to remember the individual operations
+
a defined set of inputs and expected outputs
+
rich connections to the larger programming environment
+
As the basic building block of most programming languages,
+user-defined functions constitute “programming” as much as any single
+abstraction can. If you have written a function, you are a computer
+programmer.
+
+
+
+
Defining a function
+
+
Let’s open a new R script file in the functions/
+directory and call it functions-lesson.R.
+
The general structure of a function is:
+
+
R
+
+
+my_function<-function(parameters){
+# perform action
+# return value
+}
+
+
Let’s define a function fahr_to_kelvin() that converts
+temperatures from Fahrenheit to Kelvin:
We define fahr_to_kelvin() by assigning it to the output
+of function. The list of argument names are contained
+within parentheses. Next, the body of
+the function–the statements that are executed when it runs–is contained
+within curly braces ({}). The statements in the body are
+indented by two spaces. This makes the code easier to read but does not
+affect how the code operates.
+
It is useful to think of creating functions like writing a cookbook.
+First you define the “ingredients” that your function needs. In this
+case, we only need one ingredient to use our function: “temp”. After we
+list our ingredients, we then say what we will do with them, in this
+case, we are taking our ingredient and applying a set of mathematical
+operators to it.
+
When we call the function, the values we pass to it as arguments are
+assigned to those variables so that we can use them inside the function.
+Inside the function, we use a return statement to send a
+result back to whoever asked for it.
+
+
+
+
+
+
Tip
+
+
+
One feature unique to R is that the return statement is not required.
+R automatically returns whichever variable is on the last line of the
+body of the function. But for clarity, we will explicitly define the
+return statement.
+
+
+
+
Let’s try running our function. Calling our own function is no
+different from calling any other function:
+
+
R
+
+
+# freezing point of water
+fahr_to_kelvin(32)
+
+
+
OUTPUT
+
+
[1] 273.15
+
+
+
R
+
+
+# boiling point of water
+fahr_to_kelvin(212)
+
+
+
OUTPUT
+
+
[1] 373.15
+
+
+
+
+
+
+
Challenge 1
+
+
+
Write a function called kelvin_to_celsius() that takes a
+temperature in Kelvin and returns that temperature in Celsius.
+
Hint: To convert from Kelvin to Celsius you subtract 273.15
+
+
+
+
+
+
+
+
+
Write a function called kelvin_to_celsius that takes a
+temperature in Kelvin and returns that temperature in Celsius
Now that we’ve begun to appreciate how writing functions provides an
+efficient way to make R code re-usable and modular, we should note that
+it is important to ensure that functions only work in their intended
+use-cases. Checking function parameters is related to the concept of
+defensive programming. Defensive programming encourages us to
+frequently check conditions and throw an error if something is wrong.
+These checks are referred to as assertion statements because we want to
+assert some condition is TRUE before proceeding. They make
+it easier to debug because they give us a better idea of where the
+errors originate.
+
+
Checking conditions with stopifnot()
+
+
Let’s start by re-examining fahr_to_kelvin(), our
+function for converting temperatures from Fahrenheit to Kelvin. It was
+defined like so:
For this function to work as intended, the argument temp
+must be a numeric value; otherwise, the mathematical
+procedure for converting between the two temperature scales will not
+work. To create an error, we can use the function stop().
+For example, since the argument temp must be a
+numeric vector, we could check for this condition with an
+if statement and throw an error if the condition was
+violated. We could augment our function above like so:
+
+
R
+
+
+fahr_to_kelvin<-function(temp){
+if(!is.numeric(temp)){
+stop("temp must be a numeric vector.")
+}
+kelvin<-((temp-32)*(5/9))+273.15
+return(kelvin)
+}
+
+
If we had multiple conditions or arguments to check, it would take
+many lines of code to check all of them. Luckily R provides the
+convenience function stopifnot(). We can list as many
+requirements that should evaluate to TRUE;
+stopifnot() throws an error if it finds one that is
+FALSE. Listing these conditions also serves a secondary
+purpose as extra documentation for the function.
+
Let’s try out defensive programming with stopifnot() by
+adding assertions to check the input to our function
+fahr_to_kelvin().
+
We want to assert the following: temp is a numeric
+vector. We may do that like so:
+# freezing point of water
+fahr_to_kelvin(temp =32)
+
+
+
OUTPUT
+
+
[1] 273.15
+
+
But fails instantly if given improper input.
+
+
R
+
+
+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp =as.factor(32))
+
+
+
ERROR
+
+
Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+
+
+
+
+
+
+
Challenge 3
+
+
+
Use defensive programming to ensure that our
+fahr_to_celsius() function throws an error immediately if
+the argument temp is specified inappropriately.
+
+
+
+
+
+
+
+
+
Extend our previous definition of the function by adding in an
+explicit call to stopifnot(). Since
+fahr_to_celsius() is a composition of two other functions,
+checking inside here makes adding checks to the two component functions
+redundant.
Now, we’re going to define a function that calculates the Gross
+Domestic Product of a nation from the data available in our dataset:
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat){
+gdp<-dat$pop*dat$gdpPercap
+return(gdp)
+}
+
+
We define calcGDP() by assigning it to the output of
+function. The list of argument names are contained within
+parentheses. Next, the body of the function -- the statements executed
+when you call the function – is contained within curly braces
+({}).
+
We’ve indented the statements in the body by two spaces. This makes
+the code easier to read but does not affect how it operates.
+
When we call the function, the values we pass to it are assigned to
+the arguments, which become variables inside the body of the
+function.
+
Inside the function, we use the return() function to
+send back the result. This return() function is optional: R
+will automatically return the results of whatever command is executed on
+the last line of the function.
That’s not very informative. Let’s add some more arguments so we can
+extract that per year and country.
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat, year=NULL, country=NULL){
+if(!is.null(year)){
+dat<-dat[dat$year%in%year, ]
+}
+if(!is.null(country)){
+dat<-dat[dat$country%in%country,]
+}
+gdp<-dat$pop*dat$gdpPercap
+
+new<-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
If you’ve been writing these functions down into a separate R script
+(a good idea!), you can load in the functions into our R session by
+using the source() function:
+
+
R
+
+
+source("functions/functions-lesson.R")
+
+
Ok, so there’s a lot going on in this function now. In plain English,
+the function now subsets the provided data by year if the year argument
+isn’t empty, then subsets the result by country if the country argument
+isn’t empty. Then it calculates the GDP for whatever subset emerges from
+the previous two steps. The function then adds the GDP as a new column
+to the subsetted data and returns this as the final result. You can see
+that the output is much more informative than a vector of numbers.
+
Let’s take a look at what happens when we specify the year:
+
+
R
+
+
+head(calcGDP(gapminder, year=2007))
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap gdp
+12 Afghanistan 2007 31889923 Asia 43.828 974.5803 31079291949
+24 Albania 2007 3600523 Europe 76.423 5937.0295 21376411360
+36 Algeria 2007 33333216 Africa 72.301 6223.3675 207444851958
+48 Angola 2007 12420476 Africa 42.731 4797.2313 59583895818
+60 Argentina 2007 40301927 Americas 75.320 12779.3796 515033625357
+72 Australia 2007 20434176 Oceania 81.235 34435.3674 703658358894
+
+
Or for a specific country:
+
+
R
+
+
+calcGDP(gapminder, country="Australia")
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap gdp
+61 Australia 1952 8691212 Oceania 69.120 10039.60 87256254102
+62 Australia 1957 9712569 Oceania 70.330 10949.65 106349227169
+63 Australia 1962 10794968 Oceania 70.930 12217.23 131884573002
+64 Australia 1967 11872264 Oceania 71.100 14526.12 172457986742
+65 Australia 1972 13177000 Oceania 71.930 16788.63 221223770658
+66 Australia 1977 14074100 Oceania 73.490 18334.20 258037329175
+67 Australia 1982 15184200 Oceania 74.740 19477.01 295742804309
+68 Australia 1987 16257249 Oceania 76.320 21888.89 355853119294
+69 Australia 1992 17481977 Oceania 77.560 23424.77 409511234952
+70 Australia 1997 18565243 Oceania 78.830 26997.94 501223252921
+71 Australia 2002 19546792 Oceania 80.370 30687.75 599847158654
+72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894
Here we’ve added two arguments, year, and
+country. We’ve set default arguments for both as
+NULL using the = operator in the function
+definition. This means that those arguments will take on those values
+unless the user specifies otherwise.
Here, we check whether each additional argument is set to
+null, and whenever they’re not null overwrite
+the dataset stored in dat with a subset given by the
+non-null argument.
+
Building these conditionals into the function makes it more flexible
+for later. Now, we can use it to calculate the GDP for:
+
The whole dataset;
+
A single year;
+
A single country;
+
A single combination of year and country.
+
By using %in% instead, we can also give multiple years
+or countries to those arguments.
+
+
+
+
+
+
Tip: Pass by value
+
+
+
Functions in R almost always make copies of the data to operate on
+inside of a function body. When we modify dat inside the
+function we are modifying the copy of the gapminder dataset stored in
+dat, not the original variable we gave as the first
+argument.
+
This is called “pass-by-value” and it makes writing code much safer:
+you can always be sure that whatever changes you make within the body of
+the function, stay inside the body of the function.
+
+
+
+
+
+
+
+
+
Tip: Function scope
+
+
+
Another important concept is scoping: any variables (or functions!)
+you create or modify inside the body of a function only exist for the
+lifetime of the function’s execution. When we call
+calcGDP(), the variables dat, gdp
+and new only exist inside the body of the function. Even if
+we have variables of the same name in our interactive R session, they
+are not modified in any way when executing a function.
+
+
+
+
+
R
+
+
gdp <- dat$pop * dat$gdpPercap
+ new <-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
Finally, we calculated the GDP on our new subset, and created a new
+data frame with that column added. This means when we call the function
+later we can see the context for the returned GDP values, which is much
+better than in our first attempt where we got a vector of numbers.
+
+
+
+
+
+
Challenge 4
+
+
+
Test out your GDP function by calculating the GDP for New Zealand in
+1987. How does this differ from New Zealand’s GDP in 1952?
+
+
+
+
+
+
+
+
+
+
R
+
+
+calcGDP(gapminder, year =c(1952, 1987), country ="New Zealand")
+
+
GDP for New Zealand in 1987: 65050008703
+
GDP for New Zealand in 1952: 21058193787
+
+
+
+
+
+
+
+
+
+
Challenge 5
+
+
+
The paste() function can be used to combine text
+together, e.g:
Write a function called fence() that takes two vectors
+as arguments, called text and wrapper, and
+prints out the text wrapped with the wrapper:
+
+
R
+
+
+fence(text=best_practice, wrapper="***")
+
+
Note: the paste() function has an argument
+called sep, which specifies the separator between text. The
+default is a space: ” “. The default for paste0() is no
+space”“.
+
+
+
+
+
+
+
+
+
Write a function called fence() that takes two vectors
+as arguments, called text and wrapper, and
+prints out the text wrapped with the wrapper:
[1] "*** Write programs for people not computers ***"
+
+
+
+
+
+
+
+
+
+
+
Tip
+
+
+
R has some unique aspects that can be exploited when performing more
+complicated operations. We will not be writing anything that requires
+knowledge of these more advanced concepts. In the future when you are
+comfortable writing functions in R, you can learn more by reading the R
+Language Manual or this chapter from Advanced R Programming by Hadley
+Wickham.
+
+
+
+
+
+
+
+
+
Tip: Testing and documenting
+
+
+
It’s important to both test functions and document them:
+Documentation helps you, and others, understand what the purpose of your
+function is, and how to use it, and its important to make sure that your
+function actually does what you think.
+
When you first start out, your workflow will probably look a lot like
+this:
+
Write a function
+
Comment parts of the function to document its behaviour
+
Load in the source file
+
Experiment with it in the console to make sure it behaves as you
+expect
+
Make any necessary bug fixes
+
Rinse and repeat.
+
Formal documentation for functions, written in separate
+.Rd files, gets turned into the documentation you see in
+help files. The roxygen2
+package allows R coders to write documentation alongside the function
+code and then process it into the appropriate .Rd files.
+You will want to switch to this more formal method of writing
+documentation when you start writing more complicated R projects. In
+fact, packages are, in essence, bundles of functions with this formal
+documentation. Loading your own functions through
+source("functions.R") is equivalent to loading someone
+else’s functions (or your own one day!) through
+library("package").
+
Formal automated tests can be written using the testthat package.
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
Use function to define a new function in R.
+
Use parameters to pass values into functions.
+
Use stopifnot() to flexibly check function arguments in
+R.
You have already seen how to save the most recent plot you create in
+ggplot2, using the command ggsave. As a
+refresher:
+
+
R
+
+
+ggsave("My_most_recent_plot.pdf")
+
+
You can save a plot from within RStudio using the ‘Export’ button in
+the ‘Plot’ window. This will give you the option of saving as a .pdf or
+as .png, .jpg or other image formats.
+
Sometimes you will want to save plots without creating them in the
+‘Plot’ window first. Perhaps you want to make a pdf document with
+multiple pages: each one a different plot, for example. Or perhaps
+you’re looping through multiple subsets of a file, plotting data from
+each subset, and you want to save each plot, but obviously can’t stop
+the loop to click ‘Export’ for each one.
+
In this case you can use a more flexible approach. The function
+pdf creates a new pdf device. You can control the size and
+resolution using the arguments to this function.
+
+
R
+
+
+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country))+
+geom_line()+
+theme(legend.position ="none")
+
+# You then have to make sure to turn off the pdf device!
+
+dev.off()
+
+
Open up this document and have a look.
+
+
+
+
+
+
Challenge 1
+
+
+
Rewrite your ‘pdf’ command to print a second page in the pdf, showing
+a facet plot (hint: use facet_grid) of the same data with
+one panel per continent.
+
+
diff --git a/instructor/12-plyr.html b/instructor/12-plyr.html
new file mode 100644
index 000000000..77fa8c1cf
--- /dev/null
+++ b/instructor/12-plyr.html
@@ -0,0 +1,1012 @@
+
+R for Reproducible Scientific Analysis: Splitting and Combining Data Frames with plyr
+ Skip to main content
+
How can I do different calculations on different sets of data?
+
+
+
+
+
+
+
Objectives
+
To be able to use the split-apply-combine strategy for data
+analysis.
+
+
+
+
+
+
Previously we looked at how you can use functions to simplify your
+code. We defined the calcGDP function, which takes the
+gapminder dataset, and multiplies the population and GDP per capita
+column. We also defined additional arguments so we could filter by
+year and country:
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat, year=NULL, country=NULL){
+if(!is.null(year)){
+dat<-dat[dat$year%in%year, ]
+}
+if(!is.null(country)){
+dat<-dat[dat$country%in%country,]
+}
+gdp<-dat$pop*dat$gdpPercap
+
+new<-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
A common task you’ll encounter when working with data, is that you’ll
+want to run calculations on different groups within the data. In the
+above, we were calculating the GDP by multiplying two columns together.
+But what if we wanted to calculated the mean GDP per continent?
+
We could run calcGDP and then take the mean of each
+continent:
But this isn’t very nice. Yes, by using a function, you have
+reduced a substantial amount of repetition. That is
+nice. But there is still repetition. Repeating yourself will cost you
+time, both now and later, and potentially introduce some nasty bugs.
+
We could write a new function that is flexible like
+calcGDP, but this also takes a substantial amount of effort
+and testing to get right.
+
The abstract problem we’re encountering here is know as
+“split-apply-combine”:
+
We want to split our data into groups, in this case
+continents, apply some calculations on that group, then
+optionally combine the results together afterwards.
+
The plyr package
+
+
For those of you who have used R before, you might be familiar with
+the apply family of functions. While R’s built in functions
+do work, we’re going to introduce you to another method for solving the
+“split-apply-combine” problem. The plyr package provides a set of
+functions that we find more user friendly for solving this problem.
+
We installed this package in an earlier challenge. Let us load it
+now:
+
+
R
+
+
+library("plyr")
+
+
Plyr has functions for operating on lists,
+data.frames and arrays (matrices, or
+n-dimensional vectors). Each function performs:
+
A splitting operation
+
+Apply a function on each split in turn.
+
Recombine output data as a single data object.
+
The functions are named based on the data structure they expect as
+input, and the data structure you want returned as output: [a]rray,
+[l]ist, or [d]ata.frame. The first letter corresponds to the input data
+structure, the second letter to the output data structure, and then the
+rest of the function is named “ply”.
+
This gives us 9 core functions **ply. There are an additional three
+functions which will only perform the split and apply steps, and not any
+combine step. They’re named by their input data type and represent null
+output by a _ (see table)
+
Note here that plyr’s use of “array” is different to R’s, an array in
+ply can include a vector or matrix.
+
Each of the xxply functions (daply, ddply,
+llply, laply, …) has the same structure and
+has 4 key features and structure:
+
+
R
+
+
+xxply(.data, .variables, .fun)
+
+
The first letter of the function name gives the input type and the
+second gives the output type.
+
.data - gives the data object to be processed
+
.variables - identifies the splitting variables
+
.fun - gives the function to be called on each piece
+
Now we can quickly calculate the mean GDP per continent:
continent V1
+1 Africa 20904782844
+2 Americas 379262350210
+3 Asia 227233738153
+4 Europe 269442085301
+5 Oceania 188187105354
+
+
Let us walk through the previous code:
+
The ddply function feeds in a data.frame
+(function starts with d) and returns another
+data.frame (2nd letter is a d)
+
the first argument we gave was the data.frame we wanted to operate
+on: in this case the gapminder data. We called calcGDP on
+it first so that it would have the additional gdp column
+added to it.
+
The second argument indicated our split criteria: in this case the
+“continent” column. Note that we gave the name of the column, not the
+values of the column like we had done previously with subsetting. Plyr
+takes care of these implementation details for you.
+
The third argument is the function we want to apply to each grouping
+of the data. We had to define our own short function here: each subset
+of the data gets stored in x, the first argument of our
+function. This is an anonymous function: we haven’t defined it
+elsewhere, and it has no name. It only exists in the scope of our call
+to ddply.
+
+
+
+
+
+
Challenge 1
+
+
+
Calculate the average life expectancy per continent. Which has the
+longest? Which has the shortest?
year
+continent 1952 1957 1962 1967 1972
+ Africa 5992294608 7359188796 8784876958 11443994101 15072241974
+ Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+ Asia 34095762661 47267432088 60136869012 84648519224 124385747313
+ Europe 84971341466 109989505140 138984693095 173366641137 218691462733
+ Oceania 54157223944 66826828013 82336453245 105958863585 134112109227
+ year
+continent 1977 1982 1987 1992 1997
+ Africa 18694898732 22040401045 24107264108 26256977719 30023173824
+ Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+ Asia 159802590186 194429049919 241784763369 307100497486 387597655323
+ Europe 255367522034 279484077072 316507473546 342703247405 383606933833
+ Oceania 154707711162 176177151380 209451563998 236319179826 289304255183
+ year
+continent 2002 2007
+ Africa 35303511424 45778570846
+ Americas 661248623419 776723426068
+ Asia 458042336179 627513635079
+ Europe 436448815097 493183311052
+ Oceania 345236880176 403657044512
+
+
You can use these functions in place of for loops (and
+it is usually faster to do so). To replace a for loop, put the code that
+was in the body of the for loop inside an anonymous
+function.
+
+
R
+
+
+d_ply(
+ .data=gapminder,
+ .variables ="continent",
+ .fun =function(x){
+meanGDPperCap<-mean(x$gdpPercap)
+print(paste(
+"The mean GDP per capita for", unique(x$continent),
+"is", format(meanGDPperCap, big.mark=",")
+))
+}
+)
+
+
+
OUTPUT
+
+
[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+
+
+
+
+
+
+
Tip: printing numbers
+
+
+
The format function can be used to make numeric values
+“pretty” for printing out in messages.
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Calculate the average life expectancy per continent and year. Which
+had the longest and shortest in 2007? Which had the greatest change in
+between 1952 and 2007?
How can I manipulate data frames without repeating myself?
+
+
+
+
+
+
+
Objectives
+
To be able to use the six main data frame manipulation ‘verbs’ with
+pipes in dplyr.
+
To understand how group_by() and
+summarize() can be combined to summarize datasets.
+
Be able to analyze a subset of data using logical filtering.
+
+
+
+
+
+
Manipulation of data frames means many things to many researchers: we
+often select certain observations (rows) or variables (columns), we
+often group the data by a certain variable(s), or we even calculate
+summary statistics. We can do these operations using the normal base R
+operations:
But this isn’t very nice because there is a fair bit of
+repetition. Repeating yourself will cost you time, both now and later,
+and potentially introduce some nasty bugs.
+
The dplyr package
+
+
Luckily, the dplyr
+package provides a number of very useful functions for manipulating data
+frames in a way that will reduce the above repetition, reduce the
+probability of making errors, and probably even save you some typing. As
+an added bonus, you might even find the dplyr grammar
+easier to read.
+
+
+
+
+
+
Tip: Tidyverse
+
+
+
dplyr package belongs to a broader family of opinionated
+R packages designed for data science called the “Tidyverse”. These
+packages are specifically designed to work harmoniously together. Some
+of these packages will be covered along this course, but you can find
+more complete information here: https://www.tidyverse.org/.
+
+
+
+
Here we’re going to cover 5 of the most commonly used functions as
+well as using pipes (%>%) to combine them.
+
select()
+
filter()
+
group_by()
+
summarize()
+
mutate()
+
If you have have not installed this package earlier, please do
+so:
+
+
R
+
+
+install.packages('dplyr')
+
+
Now let’s load the package:
+
+
R
+
+
+library("dplyr")
+
+
Using select()
+
+
If, for example, we wanted to move forward with only a few of the
+variables in our data frame we could use the select()
+function. This will keep only the variables you select.
If we open up year_country_gdp we’ll see that it only
+contains the year, country and gdpPercap. Above we used ‘normal’
+grammar, but the strengths of dplyr lie in combining
+several functions using pipes. Since the pipes grammar is unlike
+anything we’ve seen in R before, let’s repeat what we’ve done above
+using pipes.
To help you understand why we wrote that in that way, let’s walk
+through it step by step. First we summon the gapminder data frame and
+pass it on, using the pipe symbol %>%, to the next step,
+which is the select() function. In this case we don’t
+specify which data object we use in the select() function
+since in gets that from the previous pipe. Fun Fact:
+There is a good chance you have encountered pipes before in the shell.
+In R, a pipe symbol is %>% while in the shell it is
+| but the concept is the same!
+
+
+
+
+
+
Tip: Renaming data frame columns in dplyr
+
+
+
In Chapter 4 we covered how you can rename columns with base R by
+assigning a value to the output of the names() function.
+Just like select, this is a bit cumbersome, but thankfully dplyr has a
+rename() function.
+
Within a pipeline, the syntax is
+rename(new_name = old_name). For example, we may want to
+rename the gdpPercap column name from our select()
+statement above.
Write a single command (which can span multiple lines and includes
+pipes) that will produce a data frame that has the African values for
+lifeExp, country and year, but
+not for other Continents. How many rows does your data frame have and
+why?
As with last time, first we pass the gapminder data frame to the
+filter() function, then we pass the filtered version of the
+gapminder data frame to the select() function.
+Note: The order of operations is very important in this
+case. If we used ‘select’ first, filter would not be able to find the
+variable continent since we would have removed it in the previous
+step.
+
Using group_by()
+
+
Now, we were supposed to be reducing the error prone repetitiveness
+of what can be done with base R, but up to now we haven’t done that
+since we would have to repeat the above for each continent. Instead of
+filter(), which will only pass observations that meet your
+criteria (in the above: continent=="Europe"), we can use
+group_by(), which will essentially use every unique
+criteria that you could have used in filter.
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
You will notice that the structure of the data frame where we used
+group_by() (grouped_df) is not the same as the
+original gapminder (data.frame). A
+grouped_df can be thought of as a list where
+each item in the listis a data.frame which
+contains only the rows that correspond to the a particular value
+continent (at least in the example above).
+
Using summarize()
+
+
The above was a bit on the uneventful side but
+group_by() is much more exciting in conjunction with
+summarize(). This will allow us to create new variable(s)
+by using functions that repeat for each of the continent-specific data
+frames. That is to say, using the group_by() function, we
+split our original data frame into multiple pieces, then we can run
+functions (e.g. mean() or sd()) within
+summarize().
# A tibble: 2 × 2
+ country mean_lifeExp
+ <chr> <dbl>
+1 Iceland 76.5
+2 Sierra Leone 36.8
+
+
Another way to do this is to use the dplyr function
+arrange(), which arranges the rows in a data frame
+according to the order of one or more variables from the data frame. It
+has similar syntax to other functions from the dplyr
+package. You can use desc() inside arrange()
+to sort in descending order.
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
count() and n()
+
+
A very common operation is to count the number of observations for
+each group. The dplyr package comes with two related
+functions that help with this.
+
For instance, if we wanted to check the number of countries included
+in the dataset for the year 2002, we can use the count()
+function. It takes the name of one or more columns that contain the
+groups we are interested in, and we can optionally sort the results in
+descending order by adding sort=TRUE:
continent n
+1 Africa 52
+2 Asia 33
+3 Europe 30
+4 Americas 25
+5 Oceania 2
+
+
If we need to use the number of observations in calculations, the
+n() function is useful. It will return the total number of
+observations in the current group rather than counting the number of
+observations in each group within a specific column. For instance, if we
+wanted to get the standard error of the life expectency per
+continent:
# A tibble: 5 × 2
+ continent se_le
+ <chr> <dbl>
+1 Africa 0.366
+2 Americas 0.540
+3 Asia 0.596
+4 Europe 0.286
+5 Oceania 0.775
+
+
You can also chain together several summary operations; in this case
+calculating the minimum, maximum,
+mean and se of each continent’s per-country
+life-expectancy:
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
Connect mutate with logical filtering: ifelse
+
+
When creating new variables, we can hook this with a logical
+condition. A simple combination of mutate() and
+ifelse() facilitates filtering right where it is needed: in
+the moment of creating something new. This easy-to-read statement is a
+fast and powerful way of discarding certain data (even though the
+overall dimension of the data frame will not change) or for updating
+values depending on this given condition.
+
+
R
+
+
+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25<-gapminder%>%
+mutate(gdp_billion =ifelse(lifeExp>25, gdpPercap*pop/10^9, NA))%>%
+group_by(continent, year)%>%
+summarize(mean_gdpPercap =mean(gdpPercap),
+ sd_gdpPercap =sd(gdpPercap),
+ mean_pop =mean(pop),
+ sd_pop =sd(pop),
+ mean_gdp_billion =mean(gdp_billion),
+ sd_gdp_billion =sd(gdp_billion))
+
+
+
OUTPUT
+
+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+
R
+
+
+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp<-gapminder%>%
+mutate(gdp_futureExpectation =ifelse(lifeExp>40, gdpPercap*1.5, gdpPercap))%>%
+group_by(continent, year)%>%
+summarize(mean_gdpPercap =mean(gdpPercap),
+ mean_gdpPercap_expected =mean(gdp_futureExpectation))
+
+
+
OUTPUT
+
+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
Combining dplyr and ggplot2
+
+
First install and load ggplot2:
+
+
R
+
+
+install.packages('ggplot2')
+
+
+
R
+
+
+library("ggplot2")
+
+
In the plotting lesson we looked at how to make a multi-panel figure
+by adding a layer of facet panels using ggplot2. Here is
+the code we used (with some extra comments):
+
+
R
+
+
+# Filter countries located in the Americas
+americas<-gapminder[gapminder$continent=="Americas", ]
+# Make the plot
+ggplot(data =americas, mapping =aes(x =year, y =lifeExp))+
+geom_line()+
+facet_wrap(~country)+
+theme(axis.text.x =element_text(angle =45))
+
+
This code makes the right plot but it also creates an intermediate
+variable (americas) that we might not have any other uses
+for. Just as we used %>% to pipe data along a chain of
+dplyr functions we can use it to pass data to
+ggplot(). Because %>% replaces the first
+argument in a function we don’t need to specify the data =
+argument in the ggplot() function. By combining
+dplyr and ggplot2 functions we can make the
+same figure without creating any new variables or modifying the
+data.
+
+
R
+
+
+gapminder%>%
+# Filter countries located in the Americas
+filter(continent=="Americas")%>%
+# Make the plot
+ggplot(mapping =aes(x =year, y =lifeExp))+
+geom_line()+
+facet_wrap(~country)+
+theme(axis.text.x =element_text(angle =45))
+
+
More examples of using the function mutate() and the
+ggplot2 package.
+
+
R
+
+
+gapminder%>%
+# extract first letter of country name into new column
+mutate(startsWith =substr(country, 1, 1))%>%
+# only keep countries starting with A or Z
+filter(startsWith%in%c("A", "Z"))%>%
+# plot lifeExp into facets
+ggplot(aes(x =year, y =lifeExp, colour =continent))+
+geom_line()+
+facet_wrap(vars(country))+
+theme_minimal()
+
+
+
+
+
+
+
Advanced Challenge
+
+
+
Calculate the average life expectancy in 2002 of 2 randomly selected
+countries for each continent. Then arrange the continent names in
+reverse order. Hint: Use the dplyr
+functions arrange() and sample_n(), they have
+similar syntax to other dplyr functions.
To understand the concepts of ‘longer’ and ‘wider’ data frame
+formats and be able to convert between them with
+tidyr.
+
+
+
+
+
+
Researchers often want to reshape their data frames from ‘wide’ to
+‘longer’ layouts, or vice-versa. The ‘long’ layout or format is
+where:
+
each column is a variable
+
each row is an observation
+
In the purely ‘long’ (or ‘longest’) format, you usually have 1 column
+for the observed variable and the other columns are ID variables.
+
For the ‘wide’ format each row is often a site/subject/patient and
+you have multiple observation variables containing the same type of
+data. These can be either repeated observations over time, or
+observation of multiple variables (or a mix of both). You may find data
+input may be simpler or some other applications may prefer the ‘wide’
+format. However, many of R‘s functions have been designed
+assuming you have ’longer’ formatted data. This tutorial will help you
+efficiently transform your data shape regardless of original format.
+
Long and wide data frame layouts mainly affect readability. For
+humans, the wide format is often more intuitive since we can often see
+more of the data on the screen due to its shape. However, the long
+format is more machine readable and is closer to the formatting of
+databases. The ID variables in our data frames are similar to the fields
+in a database and observed variables are like the database values.
+
Getting started
+
+
First install the packages if you haven’t already done so (you
+probably installed dplyr in the previous lesson):
First, lets look at the structure of our original gapminder data
+frame:
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+
+
+
+
+
+
+
Challenge 1
+
+
+
Is gapminder a purely long, purely wide, or some intermediate
+format?
+
+
+
+
+
+
+
+
+
The original gapminder data.frame is in an intermediate format. It is
+not purely long since it had multiple observation variables
+(pop,lifeExp,gdpPercap).
+
+
+
+
+
Sometimes, as with the gapminder dataset, we have multiple types of
+observed data. It is somewhere in between the purely ‘long’ and ‘wide’
+data formats. We have 3 “ID variables” (continent,
+country, year) and 3 “Observation variables”
+(pop,lifeExp,gdpPercap). This
+intermediate format can be preferred despite not having ALL observations
+in 1 column given that all 3 observation variables have different units.
+There are few operations that would need us to make this data frame any
+longer (i.e. 4 ID variables and 1 Observation variable).
+
While using many of the functions in R, which are often vector based,
+you usually do not want to do mathematical operations on values with
+different units. For example, using the purely long format, a single
+mean for all of the values of population, life expectancy, and GDP would
+not be meaningful since it would return the mean of values with 3
+incompatible units. The solution is that we first manipulate the data
+either by grouping (see the lesson on dplyr), or we change
+the structure of the data frame. Note: Some plotting
+functions in R actually work better in the wide format data.
+
From wide to long format with pivot_longer()
+
+
Until now, we’ve been using the nicely formatted original gapminder
+dataset, but ‘real’ data (i.e. our own research data) will never be so
+well organized. Here let’s start with the wide formatted version of the
+gapminder dataset.
+
+
Download the wide version of the gapminder data from here and save it in your data
+folder.
+
+
We’ll load the data file and look at it. Note: we don’t want our
+continent and country columns to be factors, so we use the
+stringsAsFactors argument for read.csv() to disable
+that.
To change this very wide data frame layout back to our nice,
+intermediate (or longer) layout, we will use one of the two available
+pivot functions from the tidyr package. To
+convert from wide to a longer format, we will use the
+pivot_longer() function. pivot_longer() makes
+datasets longer by increasing the number of rows and decreasing the
+number of columns, or ‘lengthening’ your observation variables into a
+single variable.
Here we have used piping syntax which is similar to what we were
+doing in the previous lesson with dplyr. In fact, these are compatible
+and you can use a mix of tidyr and dplyr functions by piping them
+together.
+
We first provide to pivot_longer() a vector of column
+names that will be pivoted into longer format. We could type out all the
+observation variables, but as in the select() function (see
+dplyr lesson), we can use the starts_with()
+argument to select all variables that start with the desired character
+string. pivot_longer() also allows the alternative syntax
+of using the - symbol to identify which variables are not
+to be pivoted (i.e. ID variables).
+
The next arguments to pivot_longer() are
+names_to for naming the column that will contain the new ID
+variable (obstype_year) and values_to for
+naming the new amalgamated observation variable
+(obs_value). We supply these new column names as
+strings.
That may seem trivial with this particular data frame, but sometimes
+you have 1 ID variable and 40 observation variables with irregular
+variable names. The flexibility is a huge time saver!
+
Now obstype_year actually contains 2 pieces of
+information, the observation type
+(pop,lifeExp, or gdpPercap) and
+the year. We can use the separate() function
+to split the character strings into multiple variables
+
+
R
+
+
+gap_long<-gap_long%>%separate(obstype_year, into =c('obs_type', 'year'), sep ="_")
+gap_long$year<-as.integer(gap_long$year)
+
+
+
+
+
+
+
Challenge 2
+
+
+
Using gap_long, calculate the mean life expectancy,
+population, and gdpPercap for each continent. Hint: use
+the group_by() and summarize() functions we
+learned in the dplyr lesson
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+
OUTPUT
+
+
# A tibble: 15 × 3
+# Groups: continent [5]
+ continent obs_type means
+ <chr> <chr> <dbl>
+ 1 Africa gdpPercap 2194.
+ 2 Africa lifeExp 48.9
+ 3 Africa pop 9916003.
+ 4 Americas gdpPercap 7136.
+ 5 Americas lifeExp 64.7
+ 6 Americas pop 24504795.
+ 7 Asia gdpPercap 7902.
+ 8 Asia lifeExp 60.1
+ 9 Asia pop 77038722.
+10 Europe gdpPercap 14469.
+11 Europe lifeExp 71.9
+12 Europe pop 17169765.
+13 Oceania gdpPercap 18622.
+14 Oceania lifeExp 74.3
+15 Oceania pop 8874672.
+
+
+
+
+
+
From long to intermediate format with pivot_wider()
+
+
It is always good to check work. So, let’s use the second
+pivot function, pivot_wider(), to ‘widen’ our
+observation variables back out. pivot_wider() is the
+opposite of pivot_longer(), making a dataset wider by
+increasing the number of columns and decreasing the number of rows. We
+can use pivot_wider() to pivot or reshape our
+gap_long to the original intermediate format or the widest
+format. Let’s start with the intermediate format.
+
The pivot_wider() function takes names_from
+and values_from arguments.
+
To names_from we supply the column name whose contents
+will be pivoted into new output columns in the widened data frame. The
+corresponding values will be added from the column named in the
+values_from argument.
Now we’ve got an intermediate data frame gap_normal with
+the same dimensions as the original gapminder, but the
+order of the variables is different. Let’s fix that before checking if
+they are all.equal().
That’s great! We’ve gone from the longest format back to the
+intermediate and we didn’t introduce any errors in our code.
+
Now let’s convert the long all the way back to the wide. In the wide
+format, we will keep country and continent as ID variables and pivot the
+observations across the 3 metrics
+(pop,lifeExp,gdpPercap) and time
+(year). First we need to create appropriate labels for all
+our new variables (time*metric combinations) and we also need to unify
+our ID variables to simplify the process of defining
+gap_wide.
Using unite() we now have a single ID variable which is
+a combination of continent,country,and we have
+defined variable names. We’re now ready to pipe in
+pivot_wider()
Take this 1 step further and create a
+gap_ludicrously_wide format data by pivoting over
+countries, year and the 3 metrics? Hint this new data
+frame should only have 5 rows.
Understand the value of writing reproducible reports
+
Learn how to recognise and compile the basic components of an R
+Markdown file
+
Become familiar with R code chunks, and understand their purpose,
+structure and options
+
Demonstrate the use of inline chunks for weaving R outputs into text
+blocks, for example when discussing the results of some
+calculations
+
Be aware of alternative output formats to which an R Markdown file
+can be exported
+
+
+
+
+
+
Data analysis reports
+
+
Data analysts tend to write a lot of reports, describing their
+analyses and results, for their collaborators or to document their work
+for future reference.
+
Many new users begin by first writing a single R script containing
+all of their work, and then share the analysis by emailing the script
+and various graphs as attachments. But this can be cumbersome, requiring
+a lengthy discussion to explain which attachment was which result.
+
Writing formal reports with Word or LaTeX can simplify this
+process by incorporating both the analysis report and output graphs into
+a single document. But tweaking formatting to make figures look correct
+and fixing obnoxious page breaks can be tedious and lead to a lengthy
+“whack-a-mole” game of fixing new mistakes resulting from a single
+formatting change.
+
Creating a report as a web page (which is an html file) using R
+Markdown makes things easier. The report can be one long stream, so tall
+figures that wouldn’t ordinarily fit on one page can be kept at full
+size and easier to read, since the reader can simply keep scrolling.
+Additionally, the formatting of and R Markdown document is simple and
+easy to modify, allowing you to spend more time on your analyses instead
+of writing reports.
+
Literate programming
+
+
Ideally, such analysis reports are reproducible documents:
+If an error is discovered, or if some additional subjects are added to
+the data, you can just re-compile the report and get the new or
+corrected results rather than having to reconstruct figures, paste them
+into a Word document, and hand-edit various detailed results.
+
The key R package here is knitr. It allows you
+to create a document that is a mixture of text and chunks of code. When
+the document is processed by knitr, chunks of code will be
+executed, and graphs or other results will be inserted into the final
+document.
+
This sort of idea has been called “literate programming”.
+
knitr allows you to mix basically any type of text with
+code from different programming languages, but we recommend that you use
+R Markdown, which mixes Markdown with R. Markdown is a light-weight
+mark-up language for creating web pages.
+
Creating an R Markdown file
+
+
Within RStudio, click File → New File → R Markdown and you’ll get a
+dialog box like this:
+
You can stick with the default (HTML output), but give it a
+title.
+
Basic components of R Markdown
+
+
The initial chunk of text (header) contains instructions for R to
+specify what kind of document will be created, and the options chosen.
+You can use the header to give your document a title, author, date, and
+tell it what type of output you want to produce. In this case, we’re
+creating an html document.
You can delete any of those fields if you don’t want them included.
+The double-quotes aren’t strictly necessary in this case.
+They’re mostly needed if you want to include a colon in the title.
+
RStudio creates the document with some example text to get you
+started. Note below that there are chunks like
+
+```{r}
+summary(cars)
+```
+
+
These are chunks of R code that will be executed by
+knitr and replaced by their results. More on this
+later.
+
Markdown
+
+
Markdown is a system for writing web pages by marking up the text
+much as you would in an email rather than writing html code. The
+marked-up text gets converted to html, replacing the marks with
+the proper html code.
+
For now, let’s delete all of the stuff that’s there and write a bit
+of markdown.
+
You make things bold using two asterisks, like this:
+**bold**, and you make things italics by using
+underscores, like this: _italics_.
+
You can make a bulleted list by writing a list with hyphens or
+asterisks with a space between the list and other text, like this:
+
A list:
+
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks
+
or like this:
+
A second list:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+
Each will appear as:
+
bold with double-asterisks
+
italics with underscores
+
code-type font with backticks
+
You can use whatever method you prefer, but be consistent.
+This maintains the readability of your code.
+
You can make a numbered list by just using numbers. You can even use
+the same number over and over if you want:
+
1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks
+
This will appear as:
+
bold with double-asterisks
+
italics with underscores
+
code-type font with backticks
+
You can make section headers of different sizes by initiating a line
+with some number of # symbols:
+
# Title
+## Main section
+### Sub-section
+#### Sub-sub section
+
You compile the R Markdown document to an html webpage by
+clicking the “Knit” button in the upper-left.
+
+
+
+
+
+
Challenge 1
+
+
+
Create a new R Markdown document. Delete all of the R code chunks and
+write a bit of Markdown (some sections, some italicized text, and an
+itemized list).
+
Convert the document to a webpage.
+
+
+
+
+
+
+
+
+
In RStudio, select File > New file > R Markdown…
+
Delete the placeholder text and add the following:
+
# Introduction
+
+## Background on Data
+
+This report uses the *gapminder* dataset, which has columns that include:
+
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+
+## Background on Methods
+
+
Then click the ‘Knit’ button on the toolbar to generate an html
+document (webpage).
+
+
+
+
+
A bit more Markdown
+
+
You can make a hyperlink like this:
+[Carpentries Home Page](https://carpentries.org/).
+
You can include an image file like this:
+![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)
+
You can do subscripts (e.g., F2) with F~2~
+and superscripts (e.g., F2) with F^2^.
+
If you know how to write equations in LaTeX, you can use
+$ $ and $$ $$ to insert math equations, like
+$E = mc^2$ and
+
$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$
+
You can review Markdown syntax by navigating to the “Markdown Quick
+Reference” under the “Help” field in the toolbar at the top of
+RStudio.
+
R code chunks
+
+
The real power of Markdown comes from mixing markdown with chunks of
+code. This is R Markdown. When processed, the R code will be executed;
+if they produce figures, the figures will be inserted in the final
+document.
+
The main code chunks look like this:
+
+```{r load_data}
+gapminder
+
That is, you place a chunk of R code between ```{r
+chunk_name} and ```. You should give each chunk a
+unique name, as they will help you to fix errors and, if any graphs are
+produced, the file names are based on the name of the code chunk that
+produced them. You can create code chunks quickly in RStudio using the
+shortcuts Ctrl+Alt+I on Windows and
+Linux, or Cmd+Option+I on Mac.
+
+
+
+
+
+
Challenge 2
+
+
+
Add code chunks to:
+
Load the ggplot2 package
+
Read the gapminder data
+
Create a plot
+
+
+
+
+
+
+
+
+
+```{r load-ggplot2}
+library("ggplot2")
+```
+
+
+```{r read-gapminder-data}
+gapminder
+
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+
+
+
+
+
+
How things get compiled
+
+
When you press the “Knit” button, the R Markdown document is
+processed by knitr
+and a plain Markdown document is produced (as well as, potentially, a
+set of figure files): the R code is executed and replaced by both the
+input and the output; if figures are produced, links to those figures
+are included.
+
The Markdown and figure documents are then processed by the tool pandoc, which converts the
+Markdown file into an html file, with the figures embedded.
+
Chunk options
+
+
There are a variety of options to affect how the code chunks are
+treated. Here are some examples:
+
Use echo=FALSE to avoid having the code itself
+shown.
+
Use results="hide" to avoid having any results
+printed.
+
Use eval=FALSE to have the code shown but not
+evaluated.
+
Use warning=FALSE and message=FALSE to
+hide any warnings or messages produced.
+
Use fig.height and fig.width to control
+the size of the figures produced (in inches).
The fig.path option defines where the figures will be
+saved. The / here is really important; without it, the
+figures would be saved in the standard place but just with names that
+begin with Figs.
+
If you have multiple R Markdown files in a common directory, you
+might want to use fig.path to define separate prefixes for
+the figure file names, like fig.path="Figs/cleaning-" and
+fig.path="Figs/analysis-".
+
+
+
+
+
+
Challenge 3
+
+
+
Use chunk options to control the size of a figure and to hide the
+code.
You can review all of the R chunk options by navigating
+to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the
+“Help” field in the toolbar at the top of RStudio.
+
Inline R code
+
+
You can make every number in your report reproducible. Use
+`r and ` for an in-line code chunk, like so:
+`r round(some_value, 2)`. The code will be executed and
+replaced with the value of the result.
+
Don’t let these in-line chunks get split across lines.
+
Perhaps precede the paragraph with a larger code chunk that does
+calculations and defines variables, with include=FALSE for
+that larger chunk (which is the same as echo=FALSE and
+results="hide").
+
Rounding can produce differences in output in such situations. You
+may want 2.0, but round(2.03, 1) will give
+just 2.
+
The myround
+function in the R/broman
+package handles this.
+
+
+
+
+
+
Challenge 4
+
+
+
Try out a bit of in-line R code.
+
+
+
+
+
+
+
+
+
Here’s some inline code to determine that 2 + 2 = 4.
+
+
+
+
+
Other output options
+
+
You can also convert R Markdown to a PDF or a Word document. Click
+the little triangle next to the “Knit” button to get a drop-down menu.
+Or you could put pdf_document or word_document
+in the initial header of the file.
+
+
+
+
+
+
Tip: Creating PDF documents
+
+
+
Creating .pdf documents may require installation of some extra
+software. The R package tinytex provides some tools to help
+make this process easier for R users. With tinytex
+installed, run tinytex::install_tinytex() to install the
+required software (you’ll only need to do this once) and then when you
+knit to pdf tinytex will automatically detect and install
+any additional LaTeX packages that are needed to produce the pdf
+document. Visit the tinytex
+website for more information.
+
+
+
+
+
+
+
+
+
Tip: Visual markdown editing in RStudio
+
+
+
RStudio versions 1.4 and later include visual markdown editing mode.
+In visual editing mode, markdown expressions (like
+**bold words**) are transformed to the formatted appearance
+(bold words) as you type. This mode also includes a
+toolbar at the top with basic formatting buttons, similar to what you
+might see in common word processing software programs. You can turn
+visual editing on and off by pressing the button in the top right corner of your
+R Markdown document.
How can I write software that other people can use?
+
+
+
+
+
+
+
Objectives
+
Describe best practices for writing R and explain the justification
+for each.
+
+
+
+
+
+
Structure your project folder
+
+
Keep your project folder structured, organized and tidy, by creating
+subfolders for your code files, manuals, data, binaries, output plots,
+etc. It can be done completely manually, or with the help of RStudio’s
+New Project functionality, or a designated package, such as
+ProjectTemplate.
+
+
+
+
+
+
Tip: ProjectTemplate - a possible
+solution
+
+
+
One way to automate the management of projects is to install the
+third-party package, ProjectTemplate. This package will set
+up an ideal directory structure for project management. This is very
+useful as it enables you to have your analysis pipeline/workflow
+organised and structured. Together with the default RStudio project
+functionality and Git you will be able to keep track of your work as
+well as be able to share your work with collaborators.
For more information on ProjectTemplate and its functionality visit
+the home page ProjectTemplate
+
+
+
+
Make code readable
+
+
The most important part of writing code is making it readable and
+understandable. You want someone else to be able to pick up your code
+and be able to understand what it does: more often than not this someone
+will be you 6 months down the line, who will otherwise be cursing
+past-self.
+
Documentation: tell us what and why, not how
+
+
When you first start out, your comments will often describe what a
+command does, since you’re still learning yourself and it can help to
+clarify concepts and remind you later. However, these comments aren’t
+particularly useful later on when you don’t remember what problem your
+code is trying to solve. Try to also include comments that tell you
+why you’re solving a problem, and what problem that
+is. The how can come after that: it’s an implementation detail
+you ideally shouldn’t have to worry about.
+
Keep your code modular
+
+
Our recommendation is that you should separate your functions from
+your analysis scripts, and store them in a separate file that you
+source when you open the R session in your project. This
+approach is nice because it leaves you with an uncluttered analysis
+script, and a repository of useful functions that can be loaded into any
+analysis script in your project. It also lets you group related
+functions together easily.
+
Break down problem into bite size pieces
+
+
When you first start out, problem solving and function writing can be
+daunting tasks, and hard to separate from code inexperience. Try to
+break down your problem into digestible chunks and worry about the
+implementation details later: keep breaking down the problem into
+smaller and smaller functions until you reach a point where you can code
+a solution, and build back up from there.
+
Know that your code is doing the right thing
+
+
Make sure to test your functions!
+
Don’t repeat yourself
+
+
Functions enable easy reuse within a project. If you see blocks of
+similar lines of code through your project, those are usually candidates
+for being moved into functions.
+
If your calculations are performed through a series of functions,
+then the project becomes more modular and easier to change. This is
+especially the case for which a particular input always gives a
+particular output.
+
Remember to be stylish
+
+
Apply consistent style to your code.
+
+
+
+
+
+
Keypoints
+
+
+
Keep your project folder structured, organized and tidy.
+
Document what and why, not how.
+
Break programs into short single-purpose functions.
+
Write re-runnable tests.
+
Don’t repeat yourself.
+
Be consistent in naming, indentation, and other aspects of
+style.
+
+
diff --git a/instructor/404.html b/instructor/404.html
new file mode 100644
index 000000000..fc2ef6605
--- /dev/null
+++ b/instructor/404.html
@@ -0,0 +1,451 @@
+
+R for Reproducible Scientific Analysis: Page not found
+ Skip to main content
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ R for Reproducible Scientific Analysis
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Page not found
+
+
Our apologies!
+
+
We cannot seem to find the page you are looking for. Here are some
+tips that may help:
to Share—copy and redistribute the material in any
+medium or format
+
to Adapt—remix, transform, and build upon the
+material
+
for any purpose, even commercially.
+
The licensor cannot revoke these freedoms as long as you follow the
+license terms.
+
Under the following terms:
+
Attribution—You must give appropriate credit
+(mentioning that your work is derived from work that is Copyright (c)
+The Carpentries and, where practical, linking to https://carpentries.org/), provide a link to the
+license, and indicate if changes were made. You may do so in any
+reasonable manner, but not in any way that suggests the licensor
+endorses you or your use.
+
No additional restrictions—You may not apply
+legal terms or technological measures that legally restrict others from
+doing anything the license permits. With the understanding
+that:
+
Notices:
+
You do not have to comply with the license for elements of the
+material in the public domain or where your use is permitted by an
+applicable exception or limitation.
+
No warranties are given. The license may not give you all of the
+permissions necessary for your intended use. For example, other rights
+such as publicity, privacy, or moral rights may limit how you use the
+material.
+
Software
+
+
Except where otherwise noted, the example programs and other software
+provided by The Carpentries are made available under the OSI-approved MIT
+license.
+
Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+“Software”), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
Trademark
+
+
“The Carpentries”, “Software Carpentry”, “Data Carpentry”, and
+“Library Carpentry” and their respective logos are registered trademarks
+of Community Initiatives.
Describe the purpose and use of each pane in the RStudio IDE
+
Locate buttons and options in the RStudio IDE
+
Define a variable
+
Assign data to a variable
+
Manage a workspace in an interactive R session
+
Use mathematical and comparison operators
+
Call functions
+
Manage packages
+
+
+
+
+
+
+
Motivation
+
+
+
Science is a multi-step process: once you’ve designed an experiment
+and collected data, the real fun begins! This lesson will teach you how
+to start this process using R and RStudio. We will begin with raw data,
+perform exploratory analyses, and learn how to plot results graphically.
+This example starts with a dataset from gapminder.org containing population
+information for many countries through time. Can you read the data into
+R? Can you plot the population for Senegal? Can you calculate the
+average income for countries on the continent of Asia? By the end of
+these lessons you will be able to do things like plot the populations
+for all of these countries in under a minute!
+
Before Starting The Workshop
+
+
+
Please ensure you have the latest version of R and RStudio installed
+on your machine. This is important, as some packages used in the
+workshop may not install correctly (or at all) if R is not up to
+date.
Welcome to the R portion of the Software Carpentry workshop.
+
Throughout this lesson, we’re going to teach you some of the
+fundamentals of the R language as well as some best practices for
+organizing code for scientific projects that will make your life
+easier.
+
We’ll be using RStudio: a free, open-source R Integrated Development
+Environment (IDE). It provides a built-in editor, works on all platforms
+(including on servers) and provides many advantages such as integration
+with version control and project management.
+
Basic layout
+
When you first open RStudio, you will be greeted by three panels:
+
+
The interactive R console/Terminal (entire left)
+
Environment/History/Connections (tabbed in upper right)
+
Files/Plots/Packages/Help/Viewer (tabbed in lower right)
+
+
Once you open files, such as R scripts, an editor panel will also
+open in the top left.
+
+
+
+
+
+
R scripts
+
+
+
Any commands that you write in the R console can be saved to a file
+to be re-run again. Files containing R code to be ran in this way are
+called R scripts. R scripts have .R at the end of their
+names to let you know what they are.
+
+
+
+
Workflow within RStudio
+
+
+
There are two main ways one can work within RStudio:
+
+
Test and play within the interactive R console then copy code into a
+.R file to run later.
+
+
+
This works well when doing small tests and initially starting
+off.
+
It quickly becomes laborious
+
+
+
Start writing in a .R file and use RStudio’s short cut keys for the
+Run command to push the current line, selected lines or modified lines
+to the interactive R console.
+
+
+
This is a great way to start; all your code is saved for later
+
You will be able to run the file you create from within RStudio or
+using R’s source() function.
+
+
+
+
+
+
+
Tip: Running segments of your code
+
+
+
RStudio offers you great flexibility in running code from within the
+editor window. There are buttons, menu choices, and keyboard shortcuts.
+To run the current line, you can
+
+
click on the Run button above the editor panel, or
+
select “Run Lines” from the “Code” menu, or
+
hit Ctrl+Return in Windows or Linux or
+⌘+Return on OS X. (This shortcut can also be seen
+by hovering the mouse over the button). To run a block of code, select
+it and then Run. If you have modified a line of code within
+a block of code you have just run, there is no need to reselect the
+section and Run, you can use the next button along,
+Re-run the previous region. This will run the previous code
+block including the modifications you have made.
+
+
+
+
+
Introduction to R
+
+
+
Much of your time in R will be spent in the R interactive console.
+This is where you will run all of your code, and can be a useful
+environment to try out ideas before adding them to an R script file.
+This console in RStudio is the same as the one you would get if you
+typed in R in your command-line environment.
+
The first thing you will see in the R interactive session is a bunch
+of information, followed by a “>” and a blinking cursor. In many ways
+this is similar to the shell environment you learned about during the
+shell lessons: it operates on the same idea of a “Read, evaluate, print
+loop”: you type in commands, R tries to execute them, and then returns a
+result.
+
Using R as a calculator
+
+
+
The simplest thing you could do with R is to do arithmetic:
+
+
R
+
+
+1+100
+
+
+
OUTPUT
+
+
[1] 101
+
+
And R will print out the answer, with a preceding “[1]”. [1] is the
+index of the first element of the line being printed in the console. For
+more information on indexing vectors, see Episode
+6: Subsetting Data.
+
If you type in an incomplete command, R will wait for you to complete
+it. If you are familiar with Unix Shell’s bash, you may recognize
+this
+behavior from bash.
+
+
R
+
+
>1+
+
+
+
OUTPUT
+
+
+
+
+
Any time you hit return and the R session shows a “+” instead of a
+“>”, it means it’s waiting for you to complete the command. If you
+want to cancel a command you can hit Esc and RStudio will
+give you back the “>” prompt.
+
+
+
+
+
+
Tip: Canceling commands
+
+
+
If you’re using R from the command line instead of from within
+RStudio, you need to use Ctrl+C instead of
+Esc to cancel the command. This applies to Mac users as
+well!
+
Canceling a command isn’t only useful for killing incomplete
+commands: you can also use it to tell R to stop running code (for
+example if it’s taking much longer than you expect), or to get rid of
+the code you’re currently writing.
+
+
+
+
When using R as a calculator, the order of operations is the same as
+you would have learned back in school.
+
From highest to lowest precedence:
+
+
Parentheses: (, )
+
+
Exponents: ^ or **
+
+
Multiply: *
+
+
Divide: /
+
+
Add: +
+
+
Subtract: -
+
+
+
+
R
+
+
+3+5*2
+
+
+
OUTPUT
+
+
[1] 13
+
+
Use parentheses to group operations in order to force the order of
+evaluation if it differs from the default, or to make clear what you
+intend.
+
+
R
+
+
+(3+5)*2
+
+
+
OUTPUT
+
+
[1] 16
+
+
This can get unwieldy when not needed, but clarifies your intentions.
+Remember that others may later read your code.
+
+
R
+
+
+(3+(5*(2^2)))# hard to read
+3+5*2^2# clear, if you remember the rules
+3+5*(2^2)# if you forget some rules, this might help
+
+
The text after each line of code is called a “comment”. Anything that
+follows after the hash (or octothorpe) symbol # is ignored
+by R when it executes code.
+
Really small or large numbers get a scientific notation:
+
+
R
+
+
+2/10000
+
+
+
OUTPUT
+
+
[1] 2e-04
+
+
Which is shorthand for “multiplied by 10^XX”. So
+2e-4 is shorthand for 2 * 10^(-4).
+
You can write numbers in scientific notation too:
+
+
R
+
+
+5e3# Note the lack of minus here
+
+
+
OUTPUT
+
+
[1] 5000
+
+
Mathematical functions
+
+
+
R has many built in mathematical functions. To call a function, we
+can type its name, followed by open and closing parentheses. Functions
+take arguments as inputs, anything we type inside the parentheses of a
+function is considered an argument. Depending on the function, the
+number of arguments can vary from none to multiple. For example:
+
+
R
+
+
+getwd()#returns an absolute filepath
+
+
doesn’t require an argument, whereas for the next set of mathematical
+functions we will need to supply the function a value in order to
+compute the result.
+
+
R
+
+
+sin(1)# trigonometry functions
+
+
+
OUTPUT
+
+
[1] 0.841471
+
+
+
R
+
+
+log(1)# natural logarithm
+
+
+
OUTPUT
+
+
[1] 0
+
+
+
R
+
+
+log10(10)# base-10 logarithm
+
+
+
OUTPUT
+
+
[1] 1
+
+
+
R
+
+
+exp(0.5)# e^(1/2)
+
+
+
OUTPUT
+
+
[1] 1.648721
+
+
Don’t worry about trying to remember every function in R. You can
+look them up on Google, or if you can remember the start of the
+function’s name, use the tab completion in RStudio.
+
This is one advantage that RStudio has over R on its own, it has
+auto-completion abilities that allow you to more easily look up
+functions, their arguments, and the values that they take.
+
Typing a ? before the name of a command will open the
+help page for that command. When using RStudio, this will open the
+‘Help’ pane; if using R in the terminal, the help page will open in your
+browser. The help page will include a detailed description of the
+command and how it works. Scrolling to the bottom of the help page will
+usually show a collection of code examples which illustrate command
+usage. We’ll go through an example later.
+
Comparing things
+
+
+
We can also do comparisons in R:
+
+
R
+
+
+1==1# equality (note two equals signs, read as "is equal to")
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1!=2# inequality (read as "is not equal to")
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1<2# less than
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1<=1# less than or equal to
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1>0# greater than
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
R
+
+
+1>=-9# greater than or equal to
+
+
+
OUTPUT
+
+
[1] TRUE
+
+
+
+
+
+
+
Tip: Comparing Numbers
+
+
+
A word of warning about comparing numbers: you should never use
+== to compare two numbers unless they are integers (a data
+type which can specifically represent only whole numbers).
+
Computers may only represent decimal numbers with a certain degree of
+precision, so two numbers which look the same when printed out by R, may
+actually have different underlying representations and therefore be
+different by a small margin of error (called Machine numeric
+tolerance).
We can store values in variables using the assignment operator
+<-, like this:
+
+
R
+
+
+x<-1/40
+
+
Notice that assignment does not print a value. Instead, we stored it
+for later in something called a variable.
+x now contains the value
+0.025:
+
+
R
+
+
+x
+
+
+
OUTPUT
+
+
[1] 0.025
+
+
More precisely, the stored value is a decimal approximation
+of this fraction called a floating point
+number.
+
Look for the Environment tab in the top right panel of
+RStudio, and you will see that x and its value have
+appeared. Our variable x can be used in place of a number
+in any calculation that expects a number:
+
+
R
+
+
+log(x)
+
+
+
OUTPUT
+
+
[1] -3.688879
+
+
Notice also that variables can be reassigned:
+
+
R
+
+
+x<-100
+
+
x used to contain the value 0.025 and now it has the
+value 100.
+
Assignment values can contain the variable being assigned to:
+
+
R
+
+
+x<-x+1#notice how RStudio updates its description of x on the top right tab
+y<-x*2
+
+
The right hand side of the assignment can be any valid R expression.
+The right hand side is fully evaluated before the assignment
+occurs.
+
Variable names can contain letters, numbers, underscores and periods
+but no spaces. They must start with a letter or a period followed by a
+letter (they cannot start with a number nor an underscore). Variables
+beginning with a period are hidden variables. Different people use
+different conventions for long variable names, these include
+
+
periods.between.words
+
underscores_between_words
+
camelCaseToSeparateWords
+
+
What you use is up to you, but be consistent.
+
It is also possible to use the = operator for
+assignment:
+
+
R
+
+
+x=1/40
+
+
But this is much less common among R users. The most important thing
+is to be consistent with the operator you use. There
+are occasionally places where it is less confusing to use
+<- than =, and it is the most common symbol
+used in the community. So the recommendation is to use
+<-.
+
+
+
+
+
+
Challenge 1
+
+
+
Which of the following are valid R variable names?
The following will not be able to be used to create a variable
+
+
R
+
+
_age
+min-length
+2widths
+
+
+
+
+
+
Vectorization
+
+
+
One final thing to be aware of is that R is vectorized,
+meaning that variables and functions can have vectors as values. In
+contrast to physics and mathematics, a vector in R describes a set of
+values in a certain order of the same data type. For example
+
+
R
+
+
+1:5
+
+
+
OUTPUT
+
+
[1] 1 2 3 4 5
+
+
+
R
+
+
+2^(1:5)
+
+
+
OUTPUT
+
+
[1] 2 4 8 16 32
+
+
+
R
+
+
+x<-1:5
+2^x
+
+
+
OUTPUT
+
+
[1] 2 4 8 16 32
+
+
This is incredibly powerful; we will discuss this further in an
+upcoming lesson.
+
Managing your environment
+
+
+
There are a few useful commands you can use to interact with the R
+session.
+
ls will list all of the variables and functions stored
+in the global environment (your working R session):
+
+
R
+
+
+ls()
+
+
+
OUTPUT
+
+
[1] "x" "y"
+
+
+
+
+
+
+
Tip: hidden objects
+
+
+
Like in the shell, ls will hide any variables or
+functions starting with a “.” by default. To list all objects, type
+ls(all.names=TRUE) instead
+
+
+
+
Note here that we didn’t give any arguments to ls, but
+we still needed to give the parentheses to tell R to call the
+function.
+
If we type ls by itself, R prints a bunch of code
+instead of a listing of objects.
+
+
R
+
+
+ls
+
+
+
OUTPUT
+
+
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
+ pattern, sorted = TRUE)
+{
+ if (!missing(name)) {
+ pos <- tryCatch(name, error = function(e) e)
+ if (inherits(pos, "error")) {
+ name <- substitute(name)
+ if (!is.character(name))
+ name <- deparse(name)
+ warning(gettextf("%s converted to character string",
+ sQuote(name)), domain = NA)
+ pos <- name
+ }
+ }
+ all.names <- .Internal(ls(envir, all.names, sorted))
+ if (!missing(pattern)) {
+ if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
+ ll != length(grep("]", pattern, fixed = TRUE))) {
+ if (pattern == "[") {
+ pattern <- "\\["
+ warning("replaced regular expression pattern '[' by '\\\\['")
+ }
+ else if (length(grep("[^\\\\]\\[<-", pattern))) {
+ pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
+ warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
+ }
+ }
+ grep(pattern, all.names, value = TRUE)
+ }
+ else all.names
+}
+<bytecode: 0x557b0600c360>
+<environment: namespace:base>
+
+
What’s going on here?
+
Like everything in R, ls is the name of an object, and
+entering the name of an object by itself prints the contents of the
+object. The object x that we created earlier contains 1, 2,
+3, 4, 5:
+
+
R
+
+
+x
+
+
+
OUTPUT
+
+
[1] 1 2 3 4 5
+
+
The object ls contains the R code that makes the
+ls function work! We’ll talk more about how functions work
+and start writing our own later.
+
You can use rm to delete objects you no longer need:
+
+
R
+
+
+rm(x)
+
+
If you have lots of things in your environment and want to delete all
+of them, you can pass the results of ls to the
+rm function:
+
+
R
+
+
+rm(list =ls())
+
+
In this case we’ve combined the two. Like the order of operations,
+anything inside the innermost parentheses is evaluated first, and so
+on.
+
In this case we’ve specified that the results of ls
+should be used for the list argument in rm.
+When assigning values to arguments by name, you must use the
+= operator!!
+
If instead we use <-, there will be unintended side
+effects, or you may get an error message:
+
+
R
+
+
+rm(list<-ls())
+
+
+
ERROR
+
+
Error in rm(list <- ls()): ... must contain names or character strings
+
+
+
+
+
+
+
Tip: Warnings vs. Errors
+
+
+
Pay attention when R does something unexpected! Errors, like above,
+are thrown when R cannot proceed with a calculation. Warnings on the
+other hand usually mean that the function has run, but it probably
+hasn’t worked as expected.
+
In both cases, the message that R prints out usually give you clues
+how to fix a problem.
+
+
+
+
R Packages
+
+
+
It is possible to add functions to R by writing a package, or by
+obtaining a package written by someone else. As of this writing, there
+are over 10,000 packages available on CRAN (the comprehensive R archive
+network). R and RStudio have functionality for managing packages:
+
+
You can see what packages are installed by typing
+installed.packages()
+
+
You can install packages by typing
+install.packages("packagename"), where
+packagename is the package name, in quotes.
+
You can update installed packages by typing
+update.packages()
+
+
You can remove a package with
+remove.packages("packagename")
+
+
You can make a package available for use with
+library(packagename)
+
+
+
Packages can also be viewed, loaded, and detached in the Packages tab
+of the lower right panel in RStudio. Clicking on this tab will display
+all of the installed packages with a checkbox next to them. If the box
+next to a package name is checked, the package is loaded and if it is
+empty, the package is not loaded. Click an empty box to load that
+package and click a checked box to detach that package.
+
Packages can be installed and updated from the Package tab with the
+Install and Update buttons at the top of the tab.
+
+
+
+
+
+
Challenge 2
+
+
+
What will be the value of each variable after each statement in the
+following program?
The scientific process is naturally incremental, and many projects
+start life as random notes, some code, then a manuscript, and eventually
+everything is a bit mixed together.
+
+
+Managing your projects in a reproducible fashion doesn’t just make your
+science reproducible, it makes your life easier.
+
Most people tend to organize their projects like this:
+
There are many reasons why we should ALWAYS avoid this:
+
+
It is really hard to tell which version of your data is the original
+and which is the modified;
+
It gets really messy because it mixes files with various extensions
+together;
+
It probably takes you a lot of time to actually find things, and
+relate the correct figures to the exact code that has been used to
+generate it;
+
+
A good project layout will ultimately make your life easier:
+
+
It will help ensure the integrity of your data;
+
It makes it simpler to share your code with someone else (a
+lab-mate, collaborator, or supervisor);
+
It allows you to easily upload your code with your manuscript
+submission;
+
It makes it easier to pick the project back up after a break.
+
A possible solution
+
+
+
Fortunately, there are tools and packages which can help you manage
+your work effectively.
+
One of the most powerful and useful aspects of RStudio is its project
+management functionality. We’ll be using this today to create a
+self-contained, reproducible project.
+
+
+
+
+
+
Challenge 1: Creating a self-contained
+project
+
+
+
We’re going to create a new project in RStudio:
+
+
Click the “File” menu button, then “New Project”.
+
Click “New Directory”.
+
Click “New Project”.
+
Type in the name of the directory to store your project,
+e.g. “my_project”.
+
If available, select the checkbox for “Create a git
+repository.”
+
Click the “Create Project” button.
+
+
+
+
+
The simplest way to open an RStudio project once it has been created
+is to click through your file system to get to the directory where it
+was saved and double click on the .Rproj file. This will
+open RStudio and start your R session in the same directory as the
+.Rproj file. All your data, plots and scripts will now be
+relative to the project directory. RStudio projects have the added
+benefit of allowing you to open multiple projects at the same time each
+open to its own project directory. This allows you to keep multiple
+projects open without them interfering with each other.
+
+
+
+
+
+
Challenge 2: Opening an RStudio project
+through the file system
+
+
+
+
Exit RStudio.
+
Navigate to the directory where you created a project in Challenge
+1.
+
Double click on the .Rproj file in that directory.
+
+
+
+
+
Best practices for project organization
+
+
+
Although there is no “best” way to lay out a project, there are some
+general principles to adhere to that will make project management
+easier:
+
+
Treat data as read only
+
+
This is probably the most important goal of setting up a project.
+Data is typically time consuming and/or expensive to collect. Working
+with them interactively (e.g., in Excel) where they can be modified
+means you are never sure of where the data came from, or how it has been
+modified since collection. It is therefore a good idea to treat your
+data as “read-only”.
+
+
+
Data Cleaning
+
+
In many cases your data will be “dirty”: it will need significant
+preprocessing to get into a format R (or any other programming language)
+will find useful. This task is sometimes called “data munging”. Storing
+these scripts in a separate folder, and creating a second “read-only”
+data folder to hold the “cleaned” data sets can prevent confusion
+between the two sets.
+
+
+
Treat generated output as disposable
+
+
Anything generated by your scripts should be treated as disposable:
+it should all be able to be regenerated from your scripts.
+
There are lots of different ways to manage this output. Having an
+output folder with different sub-directories for each separate analysis
+makes it easier later. Since many analyses are exploratory and don’t end
+up being used in the final project, and some of the analyses get shared
+between projects.
+
+
+
+
+
+
Tip: Good Enough Practices for Scientific
+Computing
+
Put each project in its own directory, which is named after the
+project.
+
Put text documents associated with the project in the
+doc directory.
+
Put raw data and metadata in the data directory, and
+files generated during cleanup and analysis in a results
+directory.
+
Put source for the project’s scripts and programs in the
+src directory, and programs brought in from elsewhere or
+compiled locally in the bin directory.
+
Name all files to reflect their content or function.
+
+
+
+
+
+
+
Separate function definition and application
+
+
One of the more effective ways to work with R is to start by writing
+the code you want to run directly in a .R script, and then running the
+selected lines (either using the keyboard shortcuts in RStudio or
+clicking the “Run” button) in the interactive R console.
+
When your project is in its early stages, the initial .R script file
+usually contains many lines of directly executed code. As it matures,
+reusable chunks get pulled into their own functions. It’s a good idea to
+separate these functions into two separate folders; one to store useful
+functions that you’ll reuse across analyses and projects, and one to
+store the analysis scripts.
+
+
+
Save the data in the data directory
+
+
Now we have a good directory structure we will now place/save the
+data file in the data/ directory.
Download the file (right mouse click on the link above -> “Save
+link as” / “Save file as”, or click on the link and after the page
+loads, press Ctrl+S or choose File -> “Save
+page as”)
+
Make sure it’s saved under the name
+gapminder_data.csv
+
+
Save the file in the data/ folder within your
+project.
+
+
We will load and inspect these data later.
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
It is useful to get some general idea about the dataset, directly
+from the command line, before loading it into R. Understanding the
+dataset better will come in handy when making decisions on how to load
+it in R. Use the command-line shell to answer the following
+questions:
+
+
What is the size of the file?
+
How many rows of data does it contain?
+
What kinds of values are stored in this file?
+
+
+
+
+
+
+
+
+
+
By running these commands in the shell:
+
+
SH
+
+
ls-lh data/gapminder_data.csv
+
+
+
OUTPUT
+
+
-rw-r--r-- 1 runner docker 80K Oct 26 09:54 data/gapminder_data.csv
The Terminal tab in the console pane provides a convenient place
+directly within RStudio to interact directly with the command line.
+
+
+
+
+
+
Working directory
+
+
Knowing R’s current working directory is important because when you
+need to access other files (for example, to import a data file), R will
+look for them relative to the current working directory.
+
Each time you create a new RStudio Project, it will create a new
+directory for that project. When you open an existing
+.Rproj file, it will open that project and set R’s working
+directory to the folder that file is in.
+
+
+
+
+
+
Challenge 5
+
+
+
You can check the current working directory with the
+getwd() command, or by using the menus in RStudio.
+
+
In the console, type getwd() (“wd” is short for
+“working directory”) and hit Enter.
+
In the Files pane, double click on the data folder to
+open it (or navigate to any other folder you wish). To get the Files
+pane back to the current working directory, click “More” and then select
+“Go To Working Directory”.
+
+
You can change the working directory with setwd(), or by
+using RStudio menus.
+
+
In the console, type setwd("data") and hit Enter. Type
+getwd() and hit Enter to see the new working
+directory.
+
In the menus at the top of the RStudio window, click the “Session”
+menu button, and then select “Set Working Directory” and then “Choose
+Directory”. Next, in the windows navigator that opens, navigate back to
+the project directory, and click “Open”. Note that a setwd
+command will automatically appear in the console.
+
+
+
+
+
+
+
+
+
+
Tip: File does not exist errors
+
+
+
When you’re attempting to reference a file in your R code and you’re
+getting errors saying the file doesn’t exist, it’s a good idea to check
+your working directory. You need to either provide an absolute path to
+the file, or you need to make sure the file is saved in the working
+directory (or a subfolder of the working directory) and provide a
+relative path.
To be able to read R help files for functions and special
+operators.
+
To be able to use CRAN task views to identify packages to solve a
+problem.
+
To be able to seek help from your peers.
+
+
+
+
+
+
+
Reading Help Files
+
+
+
R, and every package, provide help files for functions. The general
+syntax to search for help on any function, “function_name”, from a
+specific function that is in a package loaded into your namespace (your
+interactive R session) is:
+
+
R
+
+
+?function_name
+help(function_name)
+
+
For example take a look at the help file for
+write.table(), we will be using a similar function in an
+upcoming episode.
+
+
R
+
+
+?write.table()
+
+
This will load up a help page in RStudio (or as plain text in R
+itself).
+
Each help page is broken down into sections:
+
+
Description: An extended description of what the function does.
+
Usage: The arguments of the function and their default values (which
+can be changed).
+
Arguments: An explanation of the data each argument is
+expecting.
+
Details: Any important details to be aware of.
+
Value: The data the function returns.
+
See Also: Any related functions you might find useful.
+
Examples: Some examples for how to use the function.
+
+
Different functions might have different sections, but these are the
+main ones you should be aware of.
+
Notice how related functions might call for the same help file:
+
+
R
+
+
+?write.table()
+?write.csv()
+
+
This is because these functions have very similar applicability and
+often share the same arguments as inputs to the function, so package
+authors often choose to document them together in a single help
+file.
+
+
+
+
+
+
Tip: Running Examples
+
+
+
From within the function help page, you can highlight code in the
+Examples and hit Ctrl+Return to run it in RStudio
+console. This gives you a quick way to get a feel for how a function
+works.
+
+
+
+
+
+
+
+
+
Tip: Reading Help Files
+
+
+
One of the most daunting aspects of R is the large number of
+functions available. It would be prohibitive, if not impossible to
+remember the correct usage for every function you use. Luckily, using
+the help files means you don’t have to remember that!
+
+
+
+
Special Operators
+
+
+
To seek help on special operators, use quotes or backticks:
+
+
R
+
+
+?"<-"
+?`<-`
+
+
Getting Help with Packages
+
+
+
Many packages come with “vignettes”: tutorials and extended example
+documentation. Without any arguments, vignette() will list
+all vignettes for all installed packages;
+vignette(package="package-name") will list all available
+vignettes for package-name, and
+vignette("vignette-name") will open the specified
+vignette.
+
If a package doesn’t have any vignettes, you can usually find help by
+typing help("package-name").
+
RStudio also has a set of excellent cheatsheets for
+many packages.
+
When You Remember Part of the Function Name
+
+
+
If you’re not sure what package a function is in or how it’s
+specifically spelled, you can do a fuzzy search:
+
+
R
+
+
+??function_name
+
+
A fuzzy search is when you search for an approximate string match.
+For example, you may remember that the function to set your working
+directory includes “set” in its name. You can do a fuzzy search to help
+you identify the function:
+
+
R
+
+
+??set
+
+
When You Have No Idea Where to Begin
+
+
+
If you don’t know what function or package you need to use CRAN Task Views is a
+specially maintained list of packages grouped into fields. This can be a
+good starting point.
+
When Your Code Doesn’t Work: Seeking Help from Your Peers
+
+
+
If you’re having trouble using a function, 9 times out of 10, the
+answers you seek have already been answered on Stack Overflow. You can search
+using the [r] tag. Please make sure to see their page on how to ask a good
+question.
+
If you can’t find the answer, there are a few useful functions to
+help you ask your peers:
+
+
R
+
+
+?dput
+
+
Will dump the data you’re working with into a format that can be
+copied and pasted by others into their own R session.
+
+
R
+
+
+sessionInfo()
+
+
+
OUTPUT
+
+
R version 4.3.1 (2023-06-16)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
+LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
+
+locale:
+ [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
+ [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
+ [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
+[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
+
+time zone: UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+loaded via a namespace (and not attached):
+[1] compiler_4.3.1 tools_4.3.1 rstudioapi_0.15.0 yaml_2.3.7
+[5] knitr_1.43 xfun_0.40 renv_1.0.3 evaluate_0.21
+
+
Will print out your current version of R, as well as any packages you
+have loaded. This can be useful for others to help reproduce and debug
+your issue.
+
+
+
+
+
+
Challenge 1
+
+
+
Look at the help page for the c function. What kind of
+vector do you expect will be created if you evaluate the following:
+
+
R
+
+
+c(1, 2, 3)
+c('d', 'e', 'f')
+c(1, 2, 'f')
+
+
+
+
+
+
+
+
+
+
The c() function creates a vector, in which all elements
+are of the same type. In the first case, the elements are numeric, in
+the second, they are characters, and in the third they are also
+characters: the numeric values are “coerced” to be characters.
+
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Look at the help for the paste function. You will need
+to use it later. What’s the difference between the sep and
+collapse arguments?
+
+
+
+
+
+
+
+
+
To look at the help for the paste() function, use:
+
+
R
+
+
+help("paste")
+?paste
+
+
The difference between sep and collapse is
+a little tricky. The paste function accepts any number of
+arguments, each of which can be a vector of any length. The
+sep argument specifies the string used between concatenated
+terms — by default, a space. The result is a vector as long as the
+longest argument supplied to paste. In contrast,
+collapse specifies that after concatenation the elements
+are collapsed together using the given separator, the result
+being a single string.
+
It is important to call the arguments explicitly by typing out the
+argument name e.g sep = "," so the function understands to
+use the “,” as a separator and not a term to concatenate. e.g.
+
+
R
+
+
+paste(c("a","b"), "c")
+
+
+
OUTPUT
+
+
[1] "a c" "b c"
+
+
+
R
+
+
+paste(c("a","b"), "c", ",")
+
+
+
OUTPUT
+
+
[1] "a c ," "b c ,"
+
+
+
R
+
+
+paste(c("a","b"), "c", sep =",")
+
+
+
OUTPUT
+
+
[1] "a,c" "b,c"
+
+
+
R
+
+
+paste(c("a","b"), "c", collapse ="|")
+
+
+
OUTPUT
+
+
[1] "a c|b c"
+
+
+
R
+
+
+paste(c("a","b"), "c", sep =",", collapse ="|")
+
+
+
OUTPUT
+
+
[1] "a,c|b,c"
+
+
(For more information, scroll to the bottom of the
+?paste help page and look at the examples, or try
+example('paste').)
+
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Use help to find a function (and its associated parameters) that you
+could use to load data from a tabular file in which columns are
+delimited with “\t” (tab) and the decimal point is a “.” (period). This
+check for decimal separator is important, especially if you are working
+with international colleagues, because different countries have
+different conventions for the decimal point (i.e. comma vs period).
+Hint: use ??"read table" to look up functions related to
+reading in tabular data.
+
+
+
+
+
+
+
+
+
The standard R function for reading tab-delimited files with a period
+decimal separator is read.delim(). You can also do this with
+read.table(file, sep="\t") (the period is the
+default decimal separator for read.table()),
+although you may have to change the comment.char argument
+as well if your data file contains hash (#) characters.
To begin exploring data frames, and understand how they are related
+to vectors and lists.
+
To be able to ask questions from R about the type, class, and
+structure of an object.
+
To understand the information of the attributes “names”, “class”,
+and “dim”.
+
+
+
+
+
+
+
One of R’s most powerful features is its ability to deal with tabular
+data - such as you may already have in a spreadsheet or a CSV file.
+Let’s start by making a toy dataset in your data/
+directory, called feline-data.csv:
We can now save cats as a CSV file. It is good practice
+to call the argument names explicitly so the function knows what default
+values you are changing. Here we are setting
+row.names = FALSE. Recall you can use
+?write.csv to pull up the help file to check out the
+argument names and their default values.
The read.table function is used for reading in tabular
+data stored in a text file where the columns of data are separated by
+punctuation characters such as CSV files (csv = comma-separated values).
+Tabs and commas are the most common punctuation characters used to
+separate or delimit data points in csv files. For convenience R provides
+2 other versions of read.table. These are:
+read.csv for files where the data are separated with commas
+and read.delim for files where the data are separated with
+tabs. Of these three functions read.csv is the most
+commonly used. If needed it is possible to override the default
+delimiting punctuation marks for both read.csv and
+read.delim.
+
+
+
+
+
+
Check your data for factors
+
+
+
In recent times, the default way how R handles textual data has
+changed. Text data was interpreted by R automatically into a format
+called “factors”. But there is an easier format that is called
+“character”. We will hear about factors later, and what to use them for.
+For now, remember that in most cases, they are not needed and only
+complicate your life, which is why newer R versions read in text as
+“character”. Check now if your version of R has automatically created
+factors and convert them to “character” format:
+
+
Check the data types of your input by typing
+str(cats)
+
+
In the output, look at the three-letter codes after the colons: If
+you see only “num” and “chr”, you can continue with the lesson and skip
+this box. If you find “fct”, continue to step 3.
+
Prevent R from automatically creating “factor” data. That can be
+done by the following code:
+options(stringsAsFactors = FALSE). Then, re-read the cats
+table for the change to take effect.
+
You must set this option every time you restart R. To not forget
+this, include it in your analysis script before you read in any data,
+for example in one of the first lines.
+
For R versions greater than 4.0.0, text data is no longer converted
+to factors anymore. So you can install this or a newer version to avoid
+this problem. If you are working on an institute or company computer,
+ask your administrator to do it.
+
+
+
+
+
We can begin exploring our dataset right away, pulling out columns by
+specifying them using the $ operator:
+
+
R
+
+
+cats$weight
+
+
+
OUTPUT
+
+
[1] 2.1 5.0 3.2
+
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
We can do other operations on the columns:
+
+
R
+
+
+## Say we discovered that the scale weighs two Kg light:
+cats$weight+2
+
+
+
OUTPUT
+
+
[1] 4.1 7.0 5.2
+
+
+
R
+
+
+paste("My cat is", cats$coat)
+
+
+
OUTPUT
+
+
[1] "My cat is calico" "My cat is black" "My cat is tabby"
+
+
But what about
+
+
R
+
+
+cats$weight+cats$coat
+
+
+
ERROR
+
+
Error in cats$weight + cats$coat: non-numeric argument to binary operator
+
+
Understanding what happened here is key to successfully analyzing
+data in R.
+
+
Data Types
+
+
If you guessed that the last command will return an error because
+2.1 plus "black" is nonsense, you’re right -
+and you already have some intuition for an important concept in
+programming called data types. We can ask what type of data
+something is:
+
+
R
+
+
+typeof(cats$weight)
+
+
+
OUTPUT
+
+
[1] "double"
+
+
There are 5 main types: double, integer,
+complex, logical and character.
+For historic reasons, double is also called
+numeric.
+
+
R
+
+
+typeof(3.14)
+
+
+
OUTPUT
+
+
[1] "double"
+
+
+
R
+
+
+typeof(1L)# The L suffix forces the number to be an integer, since by default R uses float numbers
+
+
+
OUTPUT
+
+
[1] "integer"
+
+
+
R
+
+
+typeof(1+1i)
+
+
+
OUTPUT
+
+
[1] "complex"
+
+
+
R
+
+
+typeof(TRUE)
+
+
+
OUTPUT
+
+
[1] "logical"
+
+
+
R
+
+
+typeof('banana')
+
+
+
OUTPUT
+
+
[1] "character"
+
+
No matter how complicated our analyses become, all data in R is
+interpreted as one of these basic data types. This strictness has some
+really important consequences.
+
A user has added details of another cat. This information is in the
+file data/feline-data_v2.csv.
+
+
R
+
+
+file.show("data/feline-data_v2.csv")
+
+
+
R
+
+
coat,weight,likes_string
+calico,2.1,1
+black,5.0,0
+tabby,3.2,1
+tabby,2.3 or 2.4,1
+
+
Load the new cats data like before, and check what type of data we
+find in the weight column:
Oh no, our weights aren’t the double type anymore! If we try to do
+the same math we did on them before, we run into trouble:
+
+
R
+
+
+cats$weight+2
+
+
+
ERROR
+
+
Error in cats$weight + 2: non-numeric argument to binary operator
+
+
What happened? The cats data we are working with is
+something called a data frame. Data frames are one of the most
+common and versatile types of data structures we will work with
+in R. A given column in a data frame cannot be composed of different
+data types. In this case, R does not read everything in the data frame
+column weight as a double, therefore the entire
+column data type changes to something that is suitable for everything in
+the column.
+
When R reads a csv file, it reads it in as a data frame.
+Thus, when we loaded the cats csv file, it is stored as a
+data frame. We can recognize data frames by the first row that is
+written by the str() function:
Data frames are composed of rows and columns, where each
+column has the same number of rows. Different columns in a data frame
+can be made up of different data types (this is what makes them so
+versatile), but everything in a given column needs to be the same type
+(e.g., vector, factor, or list).
+
Let’s explore more about different data structures and how they
+behave. For now, let’s remove that extra line from our cats data and
+reload it, while we investigate this behavior further:
To better understand this behavior, let’s meet another of the data
+structures: the vector.
+
+
R
+
+
+my_vector<-vector(length =3)
+my_vector
+
+
+
OUTPUT
+
+
[1] FALSE FALSE FALSE
+
+
A vector in R is essentially an ordered list of things, with the
+special condition that everything in the vector must be the same
+basic data type. If you don’t choose the datatype, it’ll default to
+logical; or, you can declare an empty vector of whatever
+type you like.
The somewhat cryptic output from this command indicates the basic
+data type found in this vector - in this case chr,
+character; an indication of the number of things in the vector -
+actually, the indexes of the vector, in this case [1:3];
+and a few examples of what’s actually in the vector - in this case empty
+character strings. If we similarly do
+
+
R
+
+
+str(cats$weight)
+
+
+
OUTPUT
+
+
num [1:3] 2.1 5 3.2
+
+
we see that cats$weight is a vector, too - the
+columns of data we load into R data.frames are all vectors, and
+that’s the root of why R forces everything in a column to be the same
+basic data type.
+
+
+
+
+
+
Discussion 1
+
+
+
Why is R so opinionated about what we put in our columns of data? How
+does this help us?
+
+
+
+
+
+
By keeping everything in a column the same, we allow ourselves to
+make simple assumptions about our data; if you can interpret one entry
+in the column as a number, then you can interpret all of them
+as numbers, so we don’t have to check every time. This consistency is
+what people mean when they talk about clean data; in the long
+run, strict consistency goes a long way to making our lives easier in
+R.
+
+
+
+
+
+
+
+
+
Coercion by combining vectors
+
+
You can also make vectors with explicit contents with the combine
+function:
+
+
R
+
+
+combine_vector<-c(2,6,3)
+combine_vector
+
+
+
OUTPUT
+
+
[1] 2 6 3
+
+
Given what we’ve learned so far, what do you think the following will
+produce?
+
+
R
+
+
+quiz_vector<-c(2,6,'3')
+
+
This is something called type coercion, and it is the source
+of many surprises and the reason why we need to be aware of the basic
+data types and how R will interpret them. When R encounters a mix of
+types (here double and character) to be combined into a single vector,
+it will force them all to be the same type. Consider:
The coercion rules go: logical ->
+integer -> double (“numeric”)
+-> complex -> character, where -> can
+be read as are transformed into. For example, combining
+logical and character transforms the result to
+character:
+
+
R
+
+
+c('a', TRUE)
+
+
+
OUTPUT
+
+
[1] "a" "TRUE"
+
+
A quick way to recognize character vectors is by the
+quotes that enclose them when they are printed.
+
You can try to force coercion against this flow using the
+as. functions:
As you can see, some surprising things can happen when R forces one
+basic data type into another! Nitty-gritty of type coercion aside, the
+point is: if your data doesn’t look like what you thought it was going
+to look like, type coercion may well be to blame; make sure everything
+is the same type in your vectors and your columns of data.frames, or you
+will get nasty surprises!
+
But coercion can also be very useful! For example, in our
+cats data likes_string is numeric, but we know
+that the 1s and 0s actually represent TRUE and
+FALSE (a common way of representing them). We should use
+the logical datatype here, which has two states:
+TRUE or FALSE, which is exactly what our data
+represents. We can ‘coerce’ this column to be logical by
+using the as.logical function:
An important part of every data analysis is cleaning the input data.
+If you know that the input data is all of the same format,
+(e.g. numbers), your analysis is much easier! Clean the cat data set
+from the chapter about type coercion.
+
+
Copy the code template
+
+
Create a new script in RStudio and copy and paste the following code.
+Then move on to the tasks below, which help you to fill in the gaps
+(______).
+
# Read data
+cats <- read.csv("data/feline-data_v2.csv")
+
+# 1. Print the data
+_____
+
+# 2. Show an overview of the table with all data types
+_____(cats)
+
+# 3. The "weight" column has the incorrect data type __________.
+# The correct data type is: ____________.
+
+# 4. Correct the 4th weight data point with the mean of the two given values
+cats$weight[4] <- 2.35
+# print the data again to see the effect
+cats
+
+# 5. Convert the weight to the right data type
+cats$weight <- ______________(cats$weight)
+
+# Calculate the mean to test yourself
+mean(cats$weight)
+
+# If you see the correct mean value (and not NA), you did the exercise
+# correctly!
+
+
+
Instructions for the tasks
+
+
+
1. Print the data
+
+
Execute the first statement (read.csv(...)). Then print
+the data to the console
+
+
+
+
+
+
+
+
+
+
+
Show the content of any variable by typing its name.
+
+
Solution to Challenge 1.1
+
+
Two correct solutions:
+
cats
+print(cats)
+
+
+
+
+
+
+
+
+
+
+
2. Overview of the data types
+
+
+
The data type of your data is as important as the data itself. Use a
+function we saw earlier to print out the data types of all columns of
+the cats table.
+
+
+
+
+
+
+
+
+
In the chapter “Data types” we saw two functions that can show data
+types. One printed just a single word, the data type name. The other
+printed a short form of the data type, and the first few values. We need
+the second here.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to Challenge 1.2
+
str(cats)
+
+
+
3. Which data type do we need?
+
+
The shown data type is not the right one for this data (weight of a
+cat). Which data type do we need?
+
+
Why did the read.csv() function not choose the correct
+data type?
+
Fill in the gap in the comment with the correct data type for cat
+weight!
+
+
+
+
+
+
+
+
+
+
+
Scroll up to the section about the type
+hierarchy to review the available data types
+
+
+
+
+
+
+
+
+
+
+
Weight is expressed on a continuous scale (real numbers). The R data
+type for this is “double” (also known as “numeric”).
+
The fourth row has the value “2.3 or 2.4”. That is not a number but
+two, and an english word. Therefore, the “character” data type is
+chosen. The whole column is now text, because all values in the same
+columns have to be the same data type.
+
+
+
+
+
+
+
+
+
+
+
4. Correct the problematic value
+
+
+
The code to assign a new weight value to the problematic fourth row
+is given. Think first and then execute it: What will be the data type
+after assigning a number like in this example? You can check the data
+type after executing to see if you were right.
+
+
+
+
+
+
+
+
+
Revisit the hierarchy of data types when two different data types are
+combined.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to challenge 1.4
+
The data type of the column “weight” is “character”. The assigned
+data type is “double”. Combining two data types yields the data type
+that is higher in the following hierarchy:
+
logical < integer < double < complex < character
+
Therefore, the column is still of type character! We need to manually
+convert it to “double”. {: .solution}
+
+
+
5. Convert the column “weight” to the correct data type
+
+
Cat weight are numbers. But the column does not have this data type
+yet. Coerce the column to floating point numbers.
+
+
+
+
+
+
+
+
+
+
The functions to convert data types start with as.. You
+can look for the function further up in the manuscript or use the
+RStudio auto-complete function: Type “as.” and then press
+the TAB key.
+
+
+
+
+
+
+
+
+
+
Challenge 1 (continued)
+
+
+
+
Solution to Challenge 1.5
+
There are two functions that are synonymous for historic reasons:
To change a single element, use the bracket on the other side of the
+arrow:
+
+
R
+
+
+sequence_example[1]<-30
+sequence_example
+
+
+
OUTPUT
+
+
[1] 30 21 22 23 24 25
+
+
+
+
+
+
+
Challenge 2
+
+
+
Start by making a vector with the numbers 1 through 26. Then,
+multiply the vector by 2.
+
+
+
+
+
+
+
+
+
+
R
+
+
+x<-1:26
+x<-x*2
+
+
+
+
+
+
+
+
Lists
+
+
Another data structure you’ll want in your bag of tricks is the
+list. A list is simpler in some ways than the other types,
+because you can put anything you want in it. Remember everything in
+the vector must be of the same basic data type, but a list can have
+different data types:
When printing the object structure with str(), we see
+the data types of all elements:
+
+
R
+
+
+str(list_example)
+
+
+
OUTPUT
+
+
List of 4
+ $ : num 1
+ $ : chr "a"
+ $ : logi TRUE
+ $ : cplx 1+4i
+
+
What is the use of lists? They can organize data of different
+types. For example, you can organize different tables that
+belong together, similar to spreadsheets in Excel. But there are many
+other uses, too.
+
We will see another example that will maybe surprise you in the next
+chapter.
+
To retrieve one of the elements of a list, use the double
+bracket:
+
+
R
+
+
+list_example[[2]]
+
+
+
OUTPUT
+
+
[1] "a"
+
+
The elements of lists also can have names, they can
+be given by prepending them to the values, separated by an equals
+sign:
+
+
R
+
+
+another_list<-list(title ="Numbers", numbers =1:10, data =TRUE)
+another_list
This results in a named list. Now we have a new
+function of our object! We can access single elements by an additional
+way!
+
+
R
+
+
+another_list$title
+
+
+
OUTPUT
+
+
[1] "Numbers"
+
+
+
Names
+
+
+
With names, we can give meaning to elements. It is the first time
+that we do not only have the data, but also explaining
+information. It is metadata that can be stuck to the object
+like a label. In R, this is called an attribute. Some
+attributes enable us to do more with our object, for example, like here,
+accessing an element by a self-defined name.
+
+
Accessing vectors and lists by name
+
+
We have already seen how to generate a named list. The way to
+generate a named vector is very similar. You have seen this function
+before:
The way to retrieve elements is different, though:
+
+
R
+
+
+pizza_price["pizzasubito"]
+
+
+
OUTPUT
+
+
pizzasubito
+5.64
+
+
The approach used for the list does not work:
+
+
R
+
+
+pizza_price$pizzafresh
+
+
+
ERROR
+
+
Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors
+
+
It will pay off if you remember this error message, you will meet it
+in your own analyses. It means that you have just tried accessing an
+element like it was in a list, but it is actually in a vector.
+
+
+
Accessing and changing names
+
+
If you are only interested in the names, use the names()
+function:
+
+
R
+
+
+names(pizza_price)
+
+
+
OUTPUT
+
+
[1] "pizzasubito" "pizzafresh" "callapizza"
+
+
We have seen how to access and change single elements of a vector.
+The same is possible for names:
What is the data type of the names of pizza_price? You
+can find out using the str() or typeof()
+functions.
+
+
+
+
+
+
+
+
+
+
You get the names of an object by wrapping the object name inside
+names(...). Similarly, you get the data type of the names
+by again wrapping the whole code in typeof(...):
+
typeof(names(pizza))
+
alternatively, use a new variable if this is easier for you to
+read:
+
n<-names(pizza)
+typeof(n)
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Instead of just changing some of the names a vector/list already has,
+you can also set all names of an object by writing code like (replace
+ALL CAPS text):
+
names(OBJECT)<-CHARACTER_VECTOR
+
Create a vector that gives the number for each letter in the
+alphabet!
+
+
Generate a vector called letter_no with the sequence of
+numbers from 1 to 26!
+
R has a built-in object called LETTERS. It is a
+26-character vector, from A to Z. Set the names of the number sequence
+to this 26 letters
+
Test yourself by calling letter_no["B"], which should
+give you the number 2!
+
+
+
+
+
+
+
+
+
+
letter_no<-1:26# or seq(1,26)
+names(letter_no)<-LETTERS
+letter_no["B"]
+
+
+
+
+
+
Data frames
+
+
+
We have data frames at the very beginning of this lesson, they
+represent a table of data. We didn’t go much further into detail with
+our example cat data frame:
We can now understand something a bit surprising in our data.frame;
+what happens if we run:
+
+
R
+
+
+typeof(cats)
+
+
+
OUTPUT
+
+
[1] "list"
+
+
We see that data.frames look like lists ‘under the hood’. Think again
+what we heard about what lists can be used for:
+
+
Lists organize data of different types
+
+
Columns of a data frame are vectors of different types, that are
+organized by belonging to the same table.
+
A data.frame is really a list of vectors. It is a special list in
+which all the vectors must have the same length.
+
How is this “special”-ness written into the object, so that R does
+not treat it like any other list, but as a table?
+
+
R
+
+
+class(cats)
+
+
+
OUTPUT
+
+
[1] "data.frame"
+
+
A class, just like names, is an attribute attached
+to the object. It tells us what this object means for humans.
+
You might wonder: Why do we need another
+what-type-of-object-is-this-function? We already have
+typeof()? That function tells us how the object is
+constructed in the computer. The class is
+the meaning of the object for humans. Consequently,
+what typeof() returns is fixed in R (mainly the
+five data types), whereas the output of class() is
+diverse and extendable by R packages.
+
In our cats example, we have an integer, a double and a
+logical variable. As we have seen already, each column of data.frame is
+a vector.
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
+
R
+
+
+cats[,1]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
+
R
+
+
+typeof(cats[,1])
+
+
+
OUTPUT
+
+
[1] "character"
+
+
+
R
+
+
+str(cats[,1])
+
+
+
OUTPUT
+
+
chr [1:3] "calico" "black" "tabby"
+
+
Each row is an observation of different variables, itself a
+data.frame, and thus can be composed of elements of different types.
There are several subtly different ways to call variables,
+observations and elements from data.frames:
+
+
cats[1]
+
cats[[1]]
+
cats$coat
+
cats["coat"]
+
cats[1, 1]
+
cats[, 1]
+
cats[1, ]
+
+
Try out these examples and explain what is returned by each one.
+
Hint: Use the function typeof() to examine what
+is returned in each case.
+
+
+
+
+
+
+
+
+
+
R
+
+
+cats[1]
+
+
+
OUTPUT
+
+
coat
+1 calico
+2 black
+3 tabby
+
+
We can think of a data frame as a list of vectors. The single brace
+[1] returns the first slice of the list, as another list.
+In this case it is the first column of the data frame.
+
+
R
+
+
+cats[[1]]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
The double brace [[1]] returns the contents of the list
+item. In this case it is the contents of the first column, a
+vector of type character.
+
+
R
+
+
+cats$coat
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
This example uses the $ character to address items by
+name. coat is the first column of the data frame, again a
+vector of type character.
+
+
R
+
+
+cats["coat"]
+
+
+
OUTPUT
+
+
coat
+1 calico
+2 black
+3 tabby
+
+
Here we are using a single brace ["coat"] replacing the
+index number with the column name. Like example 1, the returned object
+is a list.
+
+
R
+
+
+cats[1, 1]
+
+
+
OUTPUT
+
+
[1] "calico"
+
+
This example uses a single brace, but this time we provide row and
+column coordinates. The returned object is the value in row 1, column 1.
+The object is a vector of type character.
+
+
R
+
+
+cats[, 1]
+
+
+
OUTPUT
+
+
[1] "calico" "black" "tabby"
+
+
Like the previous example we use single braces and provide row and
+column coordinates. The row coordinate is not specified, R interprets
+this missing value as all the elements in this column and
+returns them as a vector.
+
+
R
+
+
+cats[1, ]
+
+
+
OUTPUT
+
+
coat weight likes_string
+1 calico 2.1 TRUE
+
+
Again we use the single brace with row and column coordinates. The
+column coordinate is not specified. The return value is a list
+containing all the values in the first row.
+
+
+
+
+
+
+
+
+
+
Tip: Renaming data frame columns
+
+
+
Data frames have column names, which can be accessed with the
+names() function.
+
+
R
+
+
+names(cats)
+
+
+
OUTPUT
+
+
[1] "coat" "weight" "likes_string"
+
+
If you want to rename the second column of cats, you can
+assign a new name to the second element of names(cats).
Because a matrix is a vector with added dimension attributes,
+length gives you the total number of elements in the
+matrix.
+
+
+
+
+
+
+
+
+
+
Challenge 7
+
+
+
Make another matrix, this time containing the numbers 1:50, with 5
+columns and 10 rows. Did the matrix function fill your
+matrix by column, or by row, as its default behaviour? See if you can
+figure out how to change this. (hint: read the documentation for
+matrix!)
+
+
+
+
+
+
+
+
+
Make another matrix, this time containing the numbers 1:50, with 5
+columns and 10 rows. Did the matrix function fill your
+matrix by column, or by row, as its default behaviour? See if you can
+figure out how to change this. (hint: read the documentation for
+matrix!)
+
+
R
+
+
+x<-matrix(1:50, ncol=5, nrow=10)
+x<-matrix(1:50, ncol=5, nrow=10, byrow =TRUE)# to fill by row
+
+
+
+
+
+
+
+
+
+
+
Challenge 8
+
+
+
Create a list of length two containing a character vector for each of
+the sections in this part of the workshop:
+
+
Data types
+
Data structures
+
+
Populate each character vector with the names of the data types and
+data structures we’ve seen so far.
Note: it’s nice to make a list in big writing on the board or taped
+to the wall listing all of these types and structures - leave it up for
+the rest of the workshop to remind people of the importance of these
+basics.
+
+
+
+
+
+
+
+
+
+
Challenge 9
+
+
+
Consider the R output of the matrix below:
+
+
OUTPUT
+
+
[,1] [,2]
+[1,] 4 1
+[2,] 9 5
+[3,] 10 7
+
+
What was the correct command used to write this matrix? Examine each
+command and try to figure out the correct one before typing them. Think
+about what matrices the other commands will produce.
What was the correct command used to write this matrix? Examine each
+command and try to figure out the correct one before typing them. Think
+about what matrices the other commands will produce.
Display basic properties of data frames including size and class of
+the columns, names, and first few rows.
+
+
+
+
+
+
+
At this point, you’ve seen it all: in the last lesson, we toured all
+the basic data types and data structures in R. Everything you do will be
+a manipulation of those tools. But most of the time, the star of the
+show is the data frame—the table that we created by loading information
+from a csv file. In this lesson, we’ll learn a few more things about
+working with data frames.
+
Adding columns and rows in data frames
+
+
+
We already learned that the columns of a data frame are vectors, so
+that our data are consistent in type throughout the columns. As such, if
+we want to add a new column, we can start by making a new vector:
coat weight likes_string age
+1 calico 2.1 1 2
+2 black 5.0 0 3
+3 tabby 3.2 1 5
+
+
Notice the comma with nothing after it to indicate that we want to
+drop the entire fourth row.
+
Note: we could also remove several rows at once by putting the row
+numbers inside of a vector, for example:
+cats[c(-3,-4), ]
+
Removing columns
+
+
+
We can also remove columns in our data frame. What if we want to
+remove the column “age”. We can remove it in two ways, by variable
+number or by index.
Notice the comma with nothing before it, indicating we want to keep
+all of the rows.
+
Alternatively, we can drop the column by using the index name and the
+%in% operator. The %in% operator goes through
+each element of its left argument, in this case the names of
+cats, and asks, “Does this element occur in the second
+argument?”
The key to remember when adding data to a data frame is that
+columns are vectors and rows are lists. We can also glue two
+data frames together with rbind:
You can create a new data frame right from within R with the
+following syntax:
+
+
R
+
+
+df<-data.frame(id =c("a", "b", "c"),
+ x =1:3,
+ y =c(TRUE, TRUE, FALSE))
+
+
Make a data frame that holds the following information for
+yourself:
+
+
first name
+
last name
+
lucky number
+
+
Then use rbind to add an entry for the people sitting
+beside you. Finally, use cbind to add a column with each
+person’s answer to the question, “Is it time for coffee break?”
So far, you have seen the basics of manipulating data frames with our
+cat data; now let’s use those skills to digest a more realistic dataset.
+Let’s read in the gapminder dataset that we downloaded
+previously:
+
+
R
+
+
+gapminder<-read.csv("data/gapminder_data.csv")
+
+
+
+
+
+
+
Miscellaneous Tips
+
+
+
+
Another type of file you might encounter are tab-separated value
+files (.tsv). To specify a tab as a separator, use "\\t" or
+read.delim().
+
Files can also be downloaded directly from the Internet into a
+local folder of your choice onto your computer using the
+download.file function. The read.csv function
+can then be executed to read the downloaded file from the download
+location, for example,
Alternatively, you can also read in files directly into R from the
+Internet by replacing the file paths with a web address in
+read.csv. One should note that in doing this no local copy
+of the csv file is first saved onto your computer. For example,
You can read directly from excel spreadsheets without converting
+them to plain text first by using the readxl
+package.
+
The argument “stringsAsFactors” can be useful to tell R how to
+read strings either as factors or as character strings. In R versions
+after 4.0, all strings are read-in as characters by default, but in
+earlier versions of R, strings are read-in as factors by default. For
+more information, see the call-out in the
+previous episode.
+
+
+
+
+
Let’s investigate gapminder a bit; the first thing we should always
+do is check out what the data looks like with str:
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+
+
An additional method for examining the structure of gapminder is to
+use the summary function. This function can be used on
+various objects in R. For data frames, summary yields a
+numeric, tabular, or descriptive summary of each column. Numeric or
+integer columns are described by the descriptive statistics (quartiles
+and mean), and character columns by its length, class, and mode.
+
+
R
+
+
+summary(gapminder)
+
+
+
OUTPUT
+
+
country year pop continent
+ Length:1704 Min. :1952 Min. :6.001e+04 Length:1704
+ Class :character 1st Qu.:1966 1st Qu.:2.794e+06 Class :character
+ Mode :character Median :1980 Median :7.024e+06 Mode :character
+ Mean :1980 Mean :2.960e+07
+ 3rd Qu.:1993 3rd Qu.:1.959e+07
+ Max. :2007 Max. :1.319e+09
+ lifeExp gdpPercap
+ Min. :23.60 Min. : 241.2
+ 1st Qu.:48.20 1st Qu.: 1202.1
+ Median :60.71 Median : 3531.8
+ Mean :59.47 Mean : 7215.3
+ 3rd Qu.:70.85 3rd Qu.: 9325.5
+ Max. :82.60 Max. :113523.1
+
+
Along with the str and summary functions,
+we can examine individual columns of the data frame with our
+typeof function:
We can also interrogate the data frame for information about its
+dimensions; remembering that str(gapminder) said there were
+1704 observations of 6 variables in gapminder, what do you think the
+following will produce, and why?
+
+
R
+
+
+length(gapminder)
+
+
+
OUTPUT
+
+
[1] 6
+
+
A fair guess would have been to say that the length of a data frame
+would be the number of rows it has (1704), but this is not the case;
+remember, a data frame is a list of vectors and factors:
+
+
R
+
+
+typeof(gapminder)
+
+
+
OUTPUT
+
+
[1] "list"
+
+
When length gave us 6, it’s because gapminder is built
+out of a list of 6 columns. To get the number of rows and columns in our
+dataset, try:
+
+
R
+
+
+nrow(gapminder)
+
+
+
OUTPUT
+
+
[1] 1704
+
+
+
R
+
+
+ncol(gapminder)
+
+
+
OUTPUT
+
+
[1] 6
+
+
Or, both at once:
+
+
R
+
+
+dim(gapminder)
+
+
+
OUTPUT
+
+
[1] 1704 6
+
+
We’ll also likely want to know what the titles of all the columns
+are, so we can ask for them later:
At this stage, it’s important to ask ourselves if the structure R is
+reporting matches our intuition or expectations; do the basic data types
+reported for each column make sense? If not, we need to sort any
+problems out now before they turn into bad surprises down the road,
+using what we’ve learned about how R interprets data, and the importance
+of strict consistency in how we record our data.
+
Once we’re happy that the data types and structures seem reasonable,
+it’s time to start digging into our data proper. Check out the first few
+lines:
+
+
R
+
+
+head(gapminder)
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+4 Afghanistan 1967 11537966 Asia 34.020 836.1971
+5 Afghanistan 1972 13079460 Asia 36.088 739.9811
+6 Afghanistan 1977 14880372 Asia 38.438 786.1134
+
+
+
+
+
+
+
Challenge 2
+
+
+
It’s good practice to also check the last few lines of your data and
+some in the middle. How would you do this?
+
Searching for ones specifically in the middle isn’t too hard, but we
+could ask for a few lines at random. How would you code this?
+
+
+
+
+
+
+
+
+
To check the last few lines it’s relatively simple as R already has a
+function for this:
+
+
R
+
+
+tail(gapminder)
+tail(gapminder, n =15)
+
+
What about a few arbitrary rows just in case something is odd in the
+middle?
+
+
Tip: There are several ways to achieve this.
+
+
The solution here presents one form of using nested functions, i.e. a
+function passed as an argument to another function. This might sound
+like a new concept, but you are already using it! Remember
+my_dataframe[rows, cols] will print to screen your data frame with the
+number of rows and columns you asked for (although you might have asked
+for a range or named columns for example). How would you get the last
+row if you don’t know how many rows your data frame has? R has a
+function for this. What about getting a (pseudorandom) sample? R also
+has a function for this.
+
+
R
+
+
+gapminder[sample(nrow(gapminder), 5), ]
+
+
+
+
+
+
+
To make sure our analysis is reproducible, we should put the code
+into a script file so we can come back to it later.
+
+
+
+
+
+
Challenge 3
+
+
+
Go to file -> new file -> R script, and write an R script to
+load in the gapminder dataset. Put it in the scripts/
+directory and add it to version control.
+
Run the script using the source function, using the file
+path as its argument (or by pressing the “source” button in
+RStudio).
+
+
+
+
+
+
+
+
+
The source function can be used to use a script within a
+script. Assume you would like to load the same type of file over and
+over again and therefore you need to specify the arguments to fit the
+needs of your file. Instead of writing the necessary argument again and
+again you could just write it once and save it as a script. Then, you
+can use source("Your_Script_containing_the_load_function")
+in a new script to use the function of that script without writing
+everything again. Check out ?source to find out more.
To run the script and load the data into the gapminder
+variable:
+
+
R
+
+
+source(file ="scripts/load-gapminder.R")
+
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Read the output of str(gapminder) again; this time, use
+what you’ve learned about lists and vectors, as well as the output of
+functions like colnames and dim to explain
+what everything that str prints out for gapminder means. If
+there are any parts you can’t interpret, discuss with your
+neighbors!
+
+
+
+
+
+
+
+
+
The object gapminder is a data frame with columns
+
+
+country and continent are character
+strings.
+
+year is an integer vector.
+
+pop, lifeExp, and gdpPercap
+are numeric vectors.
+
+
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
+
Use cbind() to add a new column to a data frame.
+
Use rbind() to add a new row to a data frame.
+
Remove rows from a data frame.
+
Use str(), summary(), nrow(),
+ncol(), dim(), colnames(),
+rownames(), head(), and typeof()
+to understand the structure of a data frame.
+
Read in a csv file using read.csv().
+
Understand what length() of a data frame
+represents.
In R, simple vectors containing character strings, numbers, or
+logical values are called atomic vectors because they can’t be
+further simplified.
+
+
+
+
So now that we’ve created a dummy vector to play with, how do we get
+at its contents?
+
Accessing elements using their indices
+
+
+
To extract elements of a vector we can give their corresponding
+index, starting from one:
+
+
R
+
+
+x[1]
+
+
+
OUTPUT
+
+
a
+5.4
+
+
+
R
+
+
+x[4]
+
+
+
OUTPUT
+
+
d
+4.8
+
+
It may look different, but the square brackets operator is a
+function. For vectors (and matrices), it means “get me the nth
+element”.
+
We can ask for multiple elements at once:
+
+
R
+
+
+x[c(1, 3)]
+
+
+
OUTPUT
+
+
a c
+5.4 7.1
+
+
Or slices of the vector:
+
+
R
+
+
+x[1:4]
+
+
+
OUTPUT
+
+
a b c d
+5.4 6.2 7.1 4.8
+
+
the : operator creates a sequence of numbers from the
+left element to the right.
+
+
R
+
+
+1:4
+
+
+
OUTPUT
+
+
[1] 1 2 3 4
+
+
+
R
+
+
+c(1, 2, 3, 4)
+
+
+
OUTPUT
+
+
[1] 1 2 3 4
+
+
We can ask for the same element multiple times:
+
+
R
+
+
+x[c(1,1,3)]
+
+
+
OUTPUT
+
+
a a c
+5.4 5.4 7.1
+
+
If we ask for an index beyond the length of the vector, R will return
+a missing value:
+
+
R
+
+
+x[6]
+
+
+
OUTPUT
+
+
<NA>
+ NA
+
+
This is a vector of length one containing an NA, whose
+name is also NA.
+
If we ask for the 0th element, we get an empty vector:
+
+
R
+
+
+x[0]
+
+
+
OUTPUT
+
+
named numeric(0)
+
+
+
+
+
+
+
Vector numbering in R starts at 1
+
+
+
In many programming languages (C and Python, for example), the first
+element of a vector has an index of 0. In R, the first element is 1.
+
+
+
+
Skipping and removing elements
+
+
+
If we use a negative number as the index of a vector, R will return
+every element except for the one specified:
+
+
R
+
+
+x[-2]
+
+
+
OUTPUT
+
+
a c d e
+5.4 7.1 4.8 7.5
+
+
We can skip multiple elements:
+
+
R
+
+
+x[c(-1, -5)]# or x[-c(1,5)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
+
+
+
+
Tip: Order of operations
+
+
+
A common trip up for novices occurs when trying to skip slices of a
+vector. It’s natural to try to negate a sequence like so:
+
+
R
+
+
+x[-1:3]
+
+
This gives a somewhat cryptic error:
+
+
ERROR
+
+
Error in x[-1:3]: only 0's may be mixed with negative subscripts
+
+
But remember the order of operations. : is really a
+function. It takes its first argument as -1, and its second as 3, so
+generates the sequence of numbers: c(-1, 0, 1, 2, 3).
+
The correct solution is to wrap that function call in brackets, so
+that the - operator applies to the result:
+
+
R
+
+
+x[-(1:3)]
+
+
+
OUTPUT
+
+
d e
+4.8 7.5
+
+
+
+
+
To remove elements from a vector, we need to assign the result back
+into the variable:
Come up with at least 2 different commands that will produce the
+following output:
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
After you find 2 different commands, compare notes with your
+neighbour. Did you have different strategies?
+
+
+
+
+
+
+
+
+
+
R
+
+
+x[2:4]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
R
+
+
+x[-c(1,5)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
R
+
+
+x[c(2,3,4)]
+
+
+
OUTPUT
+
+
b c d
+6.2 7.1 4.8
+
+
+
+
+
+
Subsetting by name
+
+
+
We can extract elements by using their name, instead of extracting by
+index:
+
+
R
+
+
+x<-c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5)# we can name a vector 'on the fly'
+x[c("a", "c")]
+
+
+
OUTPUT
+
+
a c
+5.4 7.1
+
+
This is usually a much more reliable way to subset objects: the
+position of various elements can often change when chaining together
+subsetting operations, but the names will always remain the same!
+
Subsetting through other logical operations
+
+
+
We can also use any logical vector to subset:
+
+
R
+
+
+x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
+
+
+
OUTPUT
+
+
c e
+7.1 7.5
+
+
Since comparison operators (e.g. >,
+<, ==) evaluate to logical vectors, we can
+also use them to succinctly subset vectors: the following statement
+gives the same result as the previous one.
+
+
R
+
+
+x[x>7]
+
+
+
OUTPUT
+
+
c e
+7.1 7.5
+
+
Breaking it down, this statement first evaluates x>7,
+generating a logical vector
+c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the
+elements of x corresponding to the TRUE
+values.
+
We can use == to mimic the previous method of indexing
+by name (remember you have to use == rather than
+= for comparisons):
+
+
R
+
+
+x[names(x)=="a"]
+
+
+
OUTPUT
+
+
a
+5.4
+
+
+
+
+
+
+
Tip: Combining logical conditions
+
+
+
We often want to combine multiple logical criteria. For example, we
+might want to find all the countries that are located in Asia
+or Europe and have life expectancies
+within a certain range. Several operations for combining logical vectors
+exist in R:
+
+
+&, the “logical AND” operator: returns
+TRUE if both the left and right are TRUE.
+
+|, the “logical OR” operator: returns
+TRUE, if either the left or right (or both) are
+TRUE.
+
+
You may sometimes see && and ||
+instead of & and |. These two-character
+operators only look at the first element of each vector and ignore the
+remaining elements. In general you should not use the two-character
+operators in data analysis; save them for programming, i.e. deciding
+whether to execute a statement.
+
+
+!, the “logical NOT” operator: converts
+TRUE to FALSE and FALSE to
+TRUE. It can negate a single logical condition (eg
+!TRUE becomes FALSE), or a whole vector of
+conditions(eg !c(TRUE, FALSE) becomes
+c(FALSE, TRUE)).
+
+
Additionally, you can compare the elements within a single vector
+using the all function (which returns TRUE if
+every element of the vector is TRUE) and the
+any function (which returns TRUE if one or
+more elements of the vector are TRUE).
Write a subsetting command to return the values in x that are greater
+than 4 and less than 7.
+
+
+
+
+
+
+
+
+
+
R
+
+
+x_subset<-x[x<7&x>4]
+print(x_subset)
+
+
+
OUTPUT
+
+
a b d
+5.4 6.2 4.8
+
+
+
+
+
+
+
+
+
+
+
Tip: Non-unique names
+
+
+
You should be aware that it is possible for multiple elements in a
+vector to have the same name. (For a data frame, columns can have the
+same name — although R tries to avoid this — but row names must be
+unique.) Consider these examples:
+
+
R
+
+
+x<-1:3
+x
+
+
+
OUTPUT
+
+
[1] 1 2 3
+
+
+
R
+
+
+names(x)<-c('a', 'a', 'a')
+x
+
+
+
OUTPUT
+
+
a a a
+1 2 3
+
+
+
R
+
+
+x['a']# only returns first value
+
+
+
OUTPUT
+
+
a
+1
+
+
+
R
+
+
+x[names(x)=='a']# returns all three values
+
+
+
OUTPUT
+
+
a a a
+1 2 3
+
+
+
+
+
+
+
+
+
+
Tip: Getting help for operators
+
+
+
Remember you can search for help on operators by wrapping them in
+quotes: help("%in%") or ?"%in%".
+
+
+
+
Skipping named elements
+
+
+
Skipping or removing named elements is a little harder. If we try to
+skip one named element by negating the string, R complains (slightly
+obscurely) that it doesn’t know how to take the negative of a
+string:
+
+
R
+
+
+x<-c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5)# we start again by naming a vector 'on the fly'
+x[-"a"]
+
+
+
ERROR
+
+
Error in -"a": invalid argument to unary operator
+
+
However, we can use the != (not-equals) operator to
+construct a logical vector that will do what we want:
+
+
R
+
+
+x[names(x)!="a"]
+
+
+
OUTPUT
+
+
b c d e
+6.2 7.1 4.8 7.5
+
+
Skipping multiple named indices is a little bit harder still. Suppose
+we want to drop the "a" and "c" elements, so
+we try this:
+
+
R
+
+
+x[names(x)!=c("a","c")]
+
+
+
WARNING
+
+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+
OUTPUT
+
+
b c d e
+6.2 7.1 4.8 7.5
+
+
R did something, but it gave us a warning that we ought to
+pay attention to - and it apparently gave us the wrong answer
+(the "c" element is still included in the vector)!
+
So what does != actually do in this case? That’s an
+excellent question.
+
+
Recycling
+
+
Let’s take a look at the comparison component of this code:
+
+
R
+
+
+names(x)!=c("a", "c")
+
+
+
WARNING
+
+
Warning in names(x) != c("a", "c"): longer object length is not a multiple of
+shorter object length
+
+
+
OUTPUT
+
+
[1] FALSE TRUE TRUE TRUE TRUE
+
+
Why does R give TRUE as the third element of this
+vector, when names(x)[3] != "c" is obviously false? When
+you use !=, R tries to compare each element of the left
+argument with the corresponding element of its right argument. What
+happens when you compare vectors of different lengths?
+
When one vector is shorter than the other, it gets
+recycled:
+
In this case R repeatsc("a", "c") as
+many times as necessary to match names(x), i.e. we get
+c("a","c","a","c","a"). Since the recycled "a"
+doesn’t match the third element of names(x), the value of
+!= is TRUE. Because in this case the longer
+vector length (5) isn’t a multiple of the shorter vector length (2), R
+printed a warning message. If we had been unlucky and
+names(x) had contained six elements, R would
+silently have done the wrong thing (i.e., not what we intended
+it to do). This recycling rule can can introduce hard-to-find and subtle
+bugs!
+
The way to get R to do what we really want (match each
+element of the left argument with all of the elements of the
+right argument) it to use the %in% operator. The
+%in% operator goes through each element of its left
+argument, in this case the names of x, and asks, “Does this
+element occur in the second argument?”. Here, since we want to
+exclude values, we also need a ! operator to
+change “in” to “not in”:
+
+
R
+
+
+x[!names(x)%in%c("a","c")]
+
+
+
OUTPUT
+
+
b d e
+6.2 4.8 7.5
+
+
+
+
+
+
+
Challenge 3
+
+
+
Selecting elements of a vector that match any of a list of components
+is a very common data analysis task. For example, the gapminder data set
+contains country and continent variables, but
+no information between these two scales. Suppose we want to pull out
+information from southeast Asia: how do we set up an operation to
+produce a logical vector that is TRUE for all of the
+countries in southeast Asia and FALSE otherwise?
+
Suppose you have these data:
+
+
R
+
+
+seAsia<-c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
+## read in the gapminder data that we downloaded in episode 2
+gapminder<-read.csv("data/gapminder_data.csv", header=TRUE)
+## extract the `country` column from a data frame (we'll see this later);
+## convert from a factor to a character;
+## and get just the non-repeated elements
+countries<-unique(as.character(gapminder$country))
+
+
There’s a wrong way (using only ==), which will give you
+a warning; a clunky way (using the logical operators == and
+|); and an elegant way (using %in%). See
+whether you can come up with all three and explain how they (don’t)
+work.
+
+
+
+
+
+
+
+
+
+
The wrong way to do this problem is
+countries==seAsia. This gives a warning
+("In countries == seAsia : longer object length is not a multiple of shorter object length")
+and the wrong answer (a vector of all FALSE values),
+because none of the recycled values of seAsia happen to
+line up correctly with matching values in country.
+
The clunky (but technically correct) way to do this
+problem is
(or countries==seAsia[1] | countries==seAsia[2] | ...).
+This gives the correct values, but hopefully you can see how awkward it
+is (what if we wanted to select countries from a much longer list?).
+
+
The best way to do this problem is
+countries %in% seAsia, which is both correct and easy to
+type (and read).
+
+
+
+
+
+
+
Handling special values
+
+
+
At some point you will encounter functions in R that cannot handle
+missing, infinite, or undefined data.
+
There are a number of special functions you can use to filter out
+this data:
+
+
+is.na will return all positions in a vector, matrix, or
+data.frame containing NA (or NaN)
+
likewise, is.nan, and is.infinite will do
+the same for NaN and Inf.
+
+is.finite will return all positions in a vector,
+matrix, or data.frame that do not contain NA,
+NaN or Inf.
+
+na.omit will filter out all missing values from a
+vector
+
Factor subsetting
+
+
+
Now that we’ve explored the different ways to subset vectors, how do
+we subset the other data structures?
+
Factor subsetting works the same way as vector subsetting.
Unlike vectors, if we try to access a row or column outside of the
+matrix, R will throw an error:
+
+
R
+
+
+m[, c(3,6)]
+
+
+
ERROR
+
+
Error in m[, c(3, 6)]: subscript out of bounds
+
+
+
+
+
+
+
Tip: Higher dimensional arrays
+
+
+
when dealing with multi-dimensional arrays, each argument to
+[ corresponds to a dimension. For example, a 3D array, the
+first three arguments correspond to the rows, columns, and depth
+dimension.
+
+
+
+
Because matrices are vectors, we can also subset using only one
+argument:
+
+
R
+
+
+m[5]
+
+
+
OUTPUT
+
+
[1] 0.3295078
+
+
This usually isn’t useful, and often confusing to read. However it is
+useful to note that matrices are laid out in column-major
+format by default. That is the elements of the vector are arranged
+column-wise:
+
+
R
+
+
+matrix(1:6, nrow=2, ncol=3)
+
+
+
OUTPUT
+
+
[,1] [,2] [,3]
+[1,] 1 3 5
+[2,] 2 4 6
+
+
If you wish to populate the matrix by row, use
+byrow=TRUE:
+
+
R
+
+
+matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
+
+
+
OUTPUT
+
+
[,1] [,2] [,3]
+[1,] 1 2 3
+[2,] 4 5 6
+
+
Matrices can also be subsetted using their rownames and column names
+instead of their row and column indices.
Which of the following commands will extract the values 11 and
+14?
+
+
A. m[2,4,2,5]
+
B. m[2:5]
+
C. m[4:5,2]
+
D. m[2,c(4,5)]
+
+
+
+
+
+
+
+
+
D
+
+
+
+
+
List subsetting
+
+
+
Now we’ll introduce some new subsetting operators. There are three
+functions used to subset lists. We’ve already seen these when learning
+about atomic vectors and matrices: [, [[, and
+$.
+
Using [ will always return a list. If you want to
+subset a list, but not extract an element, then you
+will likely use [.
+
+
R
+
+
+xlist<-list(a ="Software Carpentry", b =1:10, data =head(mtcars))
+xlist[1]
+
+
+
OUTPUT
+
+
$a
+[1] "Software Carpentry"
+
+
This returns a list with one element.
+
We can subset elements of a list exactly the same way as atomic
+vectors using [. Comparison operations however won’t work
+as they’re not recursive, they will try to condition on the data
+structures in each element of the list, not the individual elements
+within those data structures.
+xlist<-list(a ="Software Carpentry", b =1:10, data =head(mtcars))
+
+
Using your knowledge of both list and vector subsetting, extract the
+number 2 from xlist. Hint: the number 2 is contained within the “b” item
+in the list.
+
+
+
+
+
+
+
+
+
+
R
+
+
+xlist$b[2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
R
+
+
+xlist[[2]][2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
R
+
+
+xlist[["b"]][2]
+
+
+
OUTPUT
+
+
[1] 2
+
+
+
+
+
+
+
+
+
+
+
Challenge 6
+
+
+
Given a linear model:
+
+
R
+
+
+mod<-aov(pop~lifeExp, data=gapminder)
+
+
Extract the residual degrees of freedom (hint:
+attributes() will help you)
+
+
+
+
+
+
+
+
+
+
R
+
+
+attributes(mod)## `df.residual` is one of the names of `mod`
+
+
+
R
+
+
+mod$df.residual
+
+
+
+
+
+
Data frames
+
+
+
Remember the data frames are lists underneath the hood, so similar
+rules apply. However they are also two dimensional objects:
+
[ with one argument will act the same way as for lists,
+where each list element corresponds to a column. The resulting object
+will be a data frame:
Similarly, [[ will act to extract a single
+column:
+
+
R
+
+
+head(gapminder[["lifeExp"]])
+
+
+
OUTPUT
+
+
[1] 28.801 30.332 31.997 34.020 36.088 38.438
+
+
And $ provides a convenient shorthand to extract columns
+by name:
+
+
R
+
+
+head(gapminder$year)
+
+
+
OUTPUT
+
+
[1] 1952 1957 1962 1967 1972 1977
+
+
With two arguments, [ behaves the same way as for
+matrices:
+
+
R
+
+
+gapminder[1:3,]
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+1 Afghanistan 1952 8425333 Asia 28.801 779.4453
+2 Afghanistan 1957 9240934 Asia 30.332 820.8530
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+
+
If we subset a single row, the result will be a data frame (because
+the elements are mixed types):
+
+
R
+
+
+gapminder[3,]
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap
+3 Afghanistan 1962 10267083 Asia 31.997 853.1007
+
+
But for a single column the result will be a vector (this can be
+changed with the third argument, drop = FALSE).
+
+
+
+
+
+
Challenge 7
+
+
+
Fix each of the following common data frame subsetting errors:
+
+
Extract observations collected for the year 1957
+
+
+
R
+
+
gapminder[gapminder$year =1957,]
+
+
+
Extract all columns except 1 through to 4
+
+
+
R
+
+
+gapminder[,-1:4]
+
+
+
Extract the rows where the life expectancy is longer the 80
+years
+
+
+
R
+
+
+gapminder[gapminder$lifeExp>80]
+
+
+
Extract the first row, and the fourth and fifth columns
+(continent and lifeExp).
+
+
+
R
+
+
+gapminder[1, 4, 5]
+
+
+
Advanced: extract rows that contain information for the years 2002
+and 2007
+
+
+
R
+
+
+gapminder[gapminder$year==2002|2007,]
+
+
+
+
+
+
+
+
+
+
Fix each of the following common data frame subsetting errors:
Write conditional statements with if...else statements
+and ifelse().
+
Write and understand for() loops.
+
+
+
+
+
+
+
Often when we’re coding we want to control the flow of our actions.
+This can be done by setting actions to occur only if a condition or a
+set of conditions are met. Alternatively, we can also set an action to
+occur a particular number of times.
+
There are several ways you can control flow in R. For conditional
+statements, the most commonly used approaches are the constructs:
+
+
R
+
+
# if
+if (condition is true) {
+ perform action
+}
+
+# if ... else
+if (condition is true) {
+ perform action
+} else { # that is, if the condition is false,
+ perform alternative action
+}
+
+
Say, for example, that we want R to print a message if a variable
+x has a particular value:
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}
+
+x
+
+
+
OUTPUT
+
+
[1] 8
+
+
The print statement does not appear in the console because x is not
+greater than 10. To print a different message for numbers less than 10,
+we can add an else statement.
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}else{
+print("x is less than 10")
+}
+
+
+
OUTPUT
+
+
[1] "x is less than 10"
+
+
You can also test multiple conditions by using
+else if.
+
+
R
+
+
+x<-8
+
+if(x>=10){
+print("x is greater than or equal to 10")
+}elseif(x>5){
+print("x is greater than 5, but less than 10")
+}else{
+print("x is less than 5")
+}
+
+
+
OUTPUT
+
+
[1] "x is greater than 5, but less than 10"
+
+
Important: when R evaluates the condition inside
+if() statements, it is looking for a logical element, i.e.,
+TRUE or FALSE. This can cause some headaches
+for beginners. For example:
+
+
R
+
+
+x<-4==3
+if(x){
+"4 equals 3"
+}else{
+"4 does not equal 3"
+}
+
+
+
OUTPUT
+
+
[1] "4 does not equal 3"
+
+
As we can see, the not equal message was printed because the vector x
+is FALSE
+
+
R
+
+
+x<-4==3
+x
+
+
+
OUTPUT
+
+
[1] FALSE
+
+
+
+
+
+
+
Challenge 1
+
+
+
Use an if() statement to print a suitable message
+reporting whether there are any records from 2002 in the
+gapminder dataset. Now do the same for 2012.
+
+
+
+
+
+
+
+
+
We will first see a solution to Challenge 1 which does not use the
+any() function. We first obtain a logical vector describing
+which element of gapminder$year is equal to
+2002:
+
+
R
+
+
+gapminder[(gapminder$year==2002),]
+
+
Then, we count the number of rows of the data.frame
+gapminder that correspond to the 2002:
The presence of any record for the year 2002 is equivalent to the
+request that rows2002_number is one or more:
+
+
R
+
+
+rows2002_number>=1
+
+
Putting all together, we obtain:
+
+
R
+
+
+if(nrow(gapminder[(gapminder$year==2002),])>=1){
+print("Record(s) for the year 2002 found.")
+}
+
+
All this can be done more quickly with any(). The
+logical condition can be expressed as:
+
+
R
+
+
+if(any(gapminder$year==2002)){
+print("Record(s) for the year 2002 found.")
+}
+
+
+
+
+
+
Did anyone get a warning message like this?
+
+
ERROR
+
+
Error in if (gapminder$year == 2012) {: the condition has length > 1
+
+
The if() function only accepts singular (of length 1)
+inputs, and therefore returns an error when you use it with a vector.
+The if() function will still run, but will only evaluate
+the condition in the first element of the vector. Therefore, to use the
+if() function, you need to make sure your input is singular
+(of length 1).
+
+
+
+
+
+
Tip: Built in ifelse()
+function
+
+
+
R accepts both if() and
+else if() statements structured as outlined above, but also
+statements using R’s built-in ifelse()
+function. This function accepts both singular and vector inputs and is
+structured as follows:
+
+
R
+
+
# ifelse function
+ifelse(condition is true, perform action, perform alternative action)
+
+
where the first argument is the condition or a set of conditions to
+be met, the second argument is the statement that is evaluated when the
+condition is TRUE, and the third statement is the statement
+that is evaluated when the condition is FALSE.
+
+
R
+
+
+y<--3
+ifelse(y<0, "y is a negative number", "y is either positive or zero")
+
+
+
OUTPUT
+
+
[1] "y is a negative number"
+
+
+
+
+
+
+
+
+
+
Tip: any() and
+all()
+
+
+
The any() function will return TRUE if at
+least one TRUE value is found within a vector, otherwise it
+will return FALSE. This can be used in a similar way to the
+%in% operator. The function all(), as the name
+suggests, will only return TRUE if all values in the vector
+are TRUE.
+
+
+
+
Repeating operations
+
+
+
If you want to iterate over a set of values, when the order of
+iteration is important, and perform the same operation on each, a
+for() loop will do the job. We saw for() loops
+in the shell
+lessons earlier. This is the most flexible of looping operations,
+but therefore also the hardest to use correctly. In general, the advice
+of many R users would be to learn about for()
+loops, but to avoid using for() loops unless the order of
+iteration is important: i.e. the calculation at each iteration depends
+on the results of previous iterations. If the order of iteration is not
+important, then you should learn about vectorized alternatives, such as
+the purrr package, as they pay off in computational
+efficiency.
We notice in the output that when the first index (i) is
+set to 1, the second index (j) iterates through its full
+set of indices. Once the indices of j have been iterated
+through, then i is incremented. This process continues
+until the last index has been used for each for() loop.
+
Rather than printing the results, we could write the loop output to a
+new object.
This approach can be useful, but ‘growing your results’ (building the
+result object incrementally) is computationally inefficient, so avoid it
+when you are iterating through a lot of values.
+
+
+
+
+
+
Tip: don’t grow your results
+
+
+
One of the biggest things that trips up novices and experienced R
+users alike, is building a results object (vector, list, matrix, data
+frame) as your for loop progresses. Computers are very bad at handling
+this, so your calculations can very quickly slow to a crawl. It’s much
+better to define an empty results object before hand of appropriate
+dimensions, rather than initializing an empty object without dimensions.
+So if you know the end result will be stored in a matrix like above,
+create an empty matrix with 5 row and 5 columns, then at each iteration
+store the results in the appropriate location.
+
+
+
+
A better way is to define your (empty) output object before filling
+in the values. For this example, it looks more involved, but is still
+more efficient.
Sometimes you will find yourself needing to repeat an operation as
+long as a certain condition is met. You can do this with a
+while() loop.
+
+
R
+
+
while(this condition is true){
+ do a thing
+}
+
+
R will interpret a condition being met as “TRUE”.
+
As an example, here’s a while loop that generates random numbers from
+a uniform distribution (the runif() function) between 0 and
+1 until it gets one that’s less than 0.1.
while() loops will not always be appropriate. You have
+to be particularly careful that you don’t end up stuck in an infinite
+loop because your condition is always met and hence the while statement
+never terminates.
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Compare the objects output_vector and
+output_vector2. Are they the same? If not, why not? How
+would you change the last block of code to make
+output_vector2 the same as output_vector?
+
+
+
+
+
+
+
+
+
We can check whether the two vectors are identical using the
+all() function:
+
+
R
+
+
+all(output_vector==output_vector2)
+
+
However, all the elements of output_vector can be found
+in output_vector2:
+
+
R
+
+
+all(output_vector%in%output_vector2)
+
+
and vice versa:
+
+
R
+
+
+all(output_vector2%in%output_vector)
+
+
therefore, the element in output_vector and
+output_vector2 are just sorted in a different order. This
+is because as.vector() outputs the elements of an input
+matrix going over its column. Taking a look at
+output_matrix, we can notice that we want its elements by
+rows. The solution is to transpose the output_matrix. We
+can do it either by calling the transpose function t() or
+by inputting the elements in the right order. The first solution
+requires to change the original
+
+
R
+
+
+output_vector2<-as.vector(output_matrix)
+
+
into
+
+
R
+
+
+output_vector2<-as.vector(t(output_matrix))
+
+
The second solution requires to change
+
+
R
+
+
+output_matrix[i, j]<-temp_output
+
+
into
+
+
R
+
+
+output_matrix[j, i]<-temp_output
+
+
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Write a script that loops through the gapminder data by
+continent and prints out whether the mean life expectancy is smaller or
+larger than 50 years.
+
+
+
+
+
+
+
+
+
Step 1: We want to make sure we can extract all the
+unique values of the continent vector
Step 2: We also need to loop over each of these
+continents and calculate the average life expectancy for each
+subset of data. We can do that as follows:
+
+
Loop over each of the unique values of ‘continent’
+
For each value of continent, create a temporary variable storing
+that subset
+
Return the calculated life expectancy to the user by printing the
+output:
Step 3: The exercise only wants the output printed
+if the average life expectancy is less than 50 or greater than 50. So we
+need to add an if() condition before printing, which
+evaluates whether the calculated average life expectancy is above or
+below a threshold, and prints an output conditional on the result. We
+need to amend (3) from above:
+
3a. If the calculated life expectancy is less than some threshold (50
+years), return the continent and a statement that life expectancy is
+less than threshold, otherwise return the continent and a statement that
+life expectancy is greater than threshold:
+
+
R
+
+
+thresholdValue<-50
+
+for(iContinentinunique(gapminder$continent)){
+tmp<-mean(gapminder[gapminder$continent==iContinent, "lifeExp"])
+
+if(tmp<thresholdValue){
+cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n")
+}else{
+cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n")
+}# end if else condition
+rm(tmp)
+}# end for loop
+
+
+
+
+
+
+
+
+
+
+
Challenge 4
+
+
+
Modify the script from Challenge 3 to loop over each country. This
+time print out whether the life expectancy is smaller than 50, between
+50 and 70, or greater than 70.
+
+
+
+
+
+
+
+
+
We modify our solution to Challenge 3 by now adding two thresholds,
+lowerThreshold and upperThreshold and
+extending our if-else statements:
Write a script that loops over each country in the
+gapminder dataset, tests whether the country starts with a
+‘B’, and graphs life expectancy against time as a line graph if the mean
+life expectancy is under 50 years.
+
+
+
+
+
+
+
+
+
We will use the grep() command that was introduced in
+the Unix
+Shell lesson to find countries that start with “B.” Lets understand
+how to do this first. Following from the Unix shell section we may be
+tempted to try the following
+
+
R
+
+
+grep("^B", unique(gapminder$country))
+
+
But when we evaluate this command it returns the indices of the
+factor variable country that start with “B.” To get the
+values, we must add the value=TRUE option to the
+grep() command:
+
+
R
+
+
+grep("^B", unique(gapminder$country), value =TRUE)
+
+
We will now store these countries in a variable called
+candidateCountries, and then loop over each entry in the variable.
+Inside the loop, we evaluate the average life expectancy for each
+country, and if the average life expectancy is less than 50 we use
+base-plot to plot the evolution of average life expectancy using
+with() and subset():
+
+
R
+
+
+thresholdValue<-50
+candidateCountries<-grep("^B", unique(gapminder$country), value =TRUE)
+
+for(iCountryincandidateCountries){
+tmp<-mean(gapminder[gapminder$country==iCountry, "lifeExp"])
+
+if(tmp<thresholdValue){
+cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n")
+
+with(subset(gapminder, country==iCountry),
+plot(year, lifeExp,
+ type ="o",
+ main =paste("Life Expectancy in", iCountry, "over time"),
+ ylab ="Life Expectancy",
+ xlab ="Year"
+)# end plot
+)# end with
+}# end if
+rm(tmp)
+}# end for loop
Today we’ll be learning about the ggplot2 package, because it is the
+most effective for creating publication-quality graphics.
+
ggplot2 is built on the grammar of graphics, the idea that any plot
+can be built from the same set of components: a data
+set, mapping aesthetics, and graphical
+layers:
+
+
Data sets are the data that you, the user,
+provide.
+
Mapping aesthetics are what connect the data to
+the graphics. They tell ggplot2 how to use your data to affect how the
+graph looks, such as changing what is plotted on the X or Y axis, or the
+size or color of different data points.
+
Layers are the actual graphical output from
+ggplot2. Layers determine what kinds of plot are shown (scatterplot,
+histogram, etc.), the coordinate system used (rectangular, polar,
+others), and other important aspects of the plot. The idea of layers of
+graphics may be familiar to you if you have used image editing programs
+like Photoshop, Illustrator, or Inkscape.
+
+
Let’s start off building an example using the gapminder data from
+earlier. The most basic function is ggplot, which lets R
+know that we’re creating a new plot. Any of the arguments we give the
+ggplot function are the global options for the
+plot: they apply to all layers on the plot.
+
+
R
+
+
+library("ggplot2")
+ggplot(data =gapminder)
+
+
Here we called ggplot and told it what data we want to
+show on our figure. This is not enough information for
+ggplot to actually draw anything. It only creates a blank
+slate for other elements to be added to.
+
Now we’re going to add in the mapping aesthetics
+using the aes function. aes tells
+ggplot how variables in the data map to
+aesthetic properties of the figure, such as which columns of
+the data should be used for the x and
+y locations.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))
+
+
Here we told ggplot we want to plot the “gdpPercap”
+column of the gapminder data frame on the x-axis, and the “lifeExp”
+column on the y-axis. Notice that we didn’t need to explicitly pass
+aes these columns
+(e.g. x = gapminder[, "gdpPercap"]), this is because
+ggplot is smart enough to know to look in the
+data for that column!
+
The final part of making our plot is to tell ggplot how
+we want to visually represent the data. We do this by adding a new
+layer to the plot using one of the
+geom functions.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point()
+
+
Here we used geom_point, which tells ggplot
+we want to visually represent the relationship between
+x and y as a scatterplot of
+points.
+
+
+
+
+
+
Challenge 1
+
+
+
Modify the example so that the figure shows how life expectancy has
+changed over time:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+geom_point()
+
+
Hint: the gapminder dataset has a column called “year”, which should
+appear on the x-axis.
+
+
+
+
+
+
+
+
+
Here is one possible solution:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =year, y =lifeExp))+geom_point()
+
+
+
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
In the previous examples and challenge we’ve used the
+aes function to tell the scatterplot geom
+about the x and y locations of each
+point. Another aesthetic property we can modify is the point
+color. Modify the code from the previous challenge to
+color the points by the “continent” column. What trends
+do you see in the data? Are they what you expected?
+
+
+
+
+
+
+
+
+
The solution presented below adds color=continent to the
+call of the aes function. The general trend seems to
+indicate an increased life expectancy over the years. On continents with
+stronger economies we find a longer life expectancy.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_point()
+
+
+
+
+
+
+
Layers
+
+
+
Using a scatterplot probably isn’t the best for visualizing change
+over time. Instead, let’s tell ggplot to visualize the data
+as a line plot:
Instead of adding a geom_point layer, we’ve added a
+geom_line layer.
+
However, the result doesn’t look quite as we might have expected: it
+seems to be jumping around a lot in each continent. Let’s try to
+separate the data by country, plotting one line for each country:
It’s important to note that each layer is drawn on top of the
+previous layer. In this example, the points have been drawn on top
+of the lines. Here’s a demonstration:
In this example, the aesthetic mapping of
+color has been moved from the global plot options in
+ggplot to the geom_line layer so it no longer
+applies to the points. Now we can clearly see that the points are drawn
+on top of the lines.
+
+
+
+
+
+
Tip: Setting an aesthetic to a value instead
+of a mapping
+
+
+
So far, we’ve seen how to use an aesthetic (such as
+color) as a mapping to a variable in the data.
+For example, when we use
+geom_line(mapping = aes(color=continent)), ggplot will give
+a different color to each continent. But what if we want to change the
+color of all lines to blue? You may think that
+geom_line(mapping = aes(color="blue")) should work, but it
+doesn’t. Since we don’t want to create a mapping to a specific variable,
+we can move the color specification outside of the aes()
+function, like this: geom_line(color="blue").
+
+
+
+
+
+
+
+
+
Challenge 3
+
+
+
Switch the order of the point and line layers from the previous
+example. What happened?
ggplot2 also makes it easy to overlay statistical models over the
+data. To demonstrate we’ll go back to our first example:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point()
+
+
Currently it’s hard to see the relationship between the points due to
+some strong outliers in GDP per capita. We can change the scale of units
+on the x axis using the scale functions. These control the
+mapping between the data values and visual values of an aesthetic. We
+can also modify the transparency of the points, using the alpha
+function, which is especially helpful when you have a large amount of
+data which is very clustered.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()
+
+
The scale_x_log10 function applied a transformation to
+the coordinate system of the plot, so that each multiple of 10 is evenly
+spaced from left to right. For example, a GDP per capita of 1,000 is the
+same horizontal distance away from a value of 10,000 as the 10,000 value
+is from 100,000. This helps to visualize the spread of the data along
+the x-axis.
+
+
+
+
+
+
Tip Reminder: Setting an aesthetic to a value
+instead of a mapping
+
+
+
Notice that we used geom_point(alpha = 0.5). As the
+previous tip mentioned, using a setting outside of the
+aes() function will cause this value to be used for all
+points, which is what we want in this case. But just like any other
+aesthetic setting, alpha can also be mapped to a variable in
+the data. For example, we can give a different transparency to each
+continent with
+geom_point(mapping = aes(alpha = continent)).
+
+
+
+
We can fit a simple relationship to the data by adding another layer,
+geom_smooth:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()+geom_smooth(method="lm")
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
We can make the line thicker by setting the
+size aesthetic in the geom_smooth
+layer:
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp))+
+geom_point(alpha =0.5)+scale_x_log10()+geom_smooth(method="lm", size=1.5)
+
+
+
WARNING
+
+
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+ℹ Please use `linewidth` instead.
+This warning is displayed once every 8 hours.
+Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+generated.
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
There are two ways an aesthetic can be specified. Here we
+set the size aesthetic by passing it as an
+argument to geom_smooth. Previously in the lesson we’ve
+used the aes function to define a mapping between
+data variables and their visual representation.
+
+
+
+
+
+
Challenge 4a
+
+
+
Modify the color and size of the points on the point layer in the
+previous example.
+
Hint: do not use the aes function.
+
+
+
+
+
+
+
+
+
Here a possible solution: Notice that the color argument
+is supplied outside of the aes() function. This means that
+it applies to all data points on the graph and is not related to a
+specific variable.
Modify your solution to Challenge 4a so that the points are now a
+different shape and are colored by continent with new trendlines. Hint:
+The color argument can be used inside the aesthetic.
+
+
+
+
+
+
+
+
+
Here is a possible solution: Notice that supplying the
+color argument inside the aes() functions
+enables you to connect it to a certain variable. The shape
+argument, as you can see, modifies all data points the same way (it is
+outside the aes() call) while the color
+argument which is placed inside the aes() call modifies a
+point’s color based on its continent value.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =gdpPercap, y =lifeExp, color =continent))+
+geom_point(size=3, shape=17)+scale_x_log10()+
+geom_smooth(method="lm", size=1.5)
+
+
+
OUTPUT
+
+
`geom_smooth()` using formula = 'y ~ x'
+
+
+
+
+
+
+
Multi-panel figures
+
+
+
Earlier we visualized the change in life expectancy over time across
+all countries in one plot. Alternatively, we can split this out over
+multiple panels by adding a layer of facet panels.
+
+
+
+
+
+
Tip
+
+
+
We start by making a subset of data including only countries located
+in the Americas. This includes 25 countries, which will begin to clutter
+the figure. Note that we apply a “theme” definition to rotate the x-axis
+labels to maintain readability. Nearly everything in ggplot2 is
+customizable.
The facet_wrap layer took a “formula” as its argument,
+denoted by the tilde (~). This tells R to draw a panel for each unique
+value in the country column of the gapminder dataset.
+
Modifying text
+
+
+
To clean this figure up for a publication we need to change some of
+the text elements. The x-axis is too cluttered, and the y axis should
+read “Life expectancy”, rather than the column name in the data
+frame.
+
We can do this by adding a couple of different layers. The
+theme layer controls the axis text, and overall text
+size. Labels for the axes, plot title and any legend can be set using
+the labs function. Legend titles are set using the same
+names we used in the aes specification. Thus below the
+color legend title is set using color = "Continent", while
+the title of a fill legend would be set using
+fill = "MyTitle".
+
+
R
+
+
+ggplot(data =americas, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_line()+facet_wrap(~country)+
+labs(
+ x ="Year", # x axis title
+ y ="Life expectancy", # y axis title
+ title ="Figure 1", # main title of figure
+ color ="Continent"# title of legend
+)+
+theme(axis.text.x =element_text(angle =90, hjust =1))
+
+
Exporting the plot
+
+
+
The ggsave() function allows you to export a plot
+created with ggplot. You can specify the dimension and resolution of
+your plot by adjusting the appropriate arguments (width,
+height and dpi) to create high quality
+graphics for publication. In order to save the plot from above, we first
+assign it to a variable lifeExp_plot, then tell
+ggsave to save that plot in png format to a
+directory called results. (Make sure you have a
+results/ folder in your working directory.)
+
+
R
+
+
+lifeExp_plot<-ggplot(data =americas, mapping =aes(x =year, y =lifeExp, color=continent))+
+geom_line()+facet_wrap(~country)+
+labs(
+ x ="Year", # x axis title
+ y ="Life expectancy", # y axis title
+ title ="Figure 1", # main title of figure
+ color ="Continent"# title of legend
+)+
+theme(axis.text.x =element_text(angle =90, hjust =1))
+
+ggsave(filename ="results/lifeExp.png", plot =lifeExp_plot, width =12, height =10, dpi =300, units ="cm")
+
+
There are two nice things about ggsave. First, it
+defaults to the last plot, so if you omit the plot argument
+it will automatically save the last plot you created with
+ggplot. Secondly, it tries to determine the format you want
+to save your plot in from the file extension you provide for the
+filename (for example .png or .pdf). If you
+need to, you can specify the format explicitly in the
+device argument.
+
This is a taste of what you can do with ggplot2. RStudio provides a
+really useful cheat
+sheet of the different layers available, and more extensive
+documentation is available on the ggplot2 website. All
+RStudio cheat sheets can be found here. Finally,
+if you have no idea how to change something, a quick Google search will
+usually send you to a relevant question and answer on Stack Overflow
+with reusable code to modify!
+
+
+
+
+
+
Challenge 5
+
+
+
Generate boxplots to compare life expectancy between the different
+continents during the available years.
+
Advanced:
+
+
Rename y axis as Life Expectancy.
+
Remove x axis labels.
+
+
+
+
+
+
+
+
+
+
Here a possible solution: xlab() and ylab()
+set labels for the x and y axes, respectively The axis title, text and
+ticks are attributes of the theme and must be modified within a
+theme() call.
+
+
R
+
+
+ggplot(data =gapminder, mapping =aes(x =continent, y =lifeExp, fill =continent))+
+geom_boxplot()+facet_wrap(~year)+
+ylab("Life Expectancy")+
+theme(axis.title.x=element_blank(),
+ axis.text.x =element_blank(),
+ axis.ticks.x =element_blank())
+
+
+
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
+
Use ggplot2 to create plots.
+
Think about graphics in layers: aesthetics, geometry, statistics,
+scale transformation, and grouping.
How can I operate on all the elements of a vector at once?
+
+
+
+
+
+
+
+
Objectives
+
+
To understand vectorized operations in R.
+
+
+
+
+
+
+
Most of R’s functions are vectorized, meaning that the function will
+operate on all elements of a vector without needing to loop through and
+act on each element one at a time. This makes writing code more concise,
+easy to read, and less error prone.
+
+
R
+
+
+x<-1:4
+x*2
+
+
+
OUTPUT
+
+
[1] 2 4 6 8
+
+
The multiplication happened to each element of the vector.
+
We can also add two vectors together:
+
+
R
+
+
+y<-6:9
+x+y
+
+
+
OUTPUT
+
+
[1] 7 9 11 13
+
+
Each element of x was added to its corresponding element
+of y:
+
+
R
+
+
x:1234
+++++
+y:6789
+---------------
+791113
+
+
Here is how we would add two vectors together using a for loop:
Compare this to the output using vectorised operations.
+
+
R
+
+
+sum_xy<-x+y
+sum_xy
+
+
+
OUTPUT
+
+
[1] 7 9 11 13
+
+
+
+
+
+
+
Challenge 1
+
+
+
Let’s try this on the pop column of the
+gapminder dataset.
+
Make a new column in the gapminder data frame that
+contains population in units of millions of people. Check the head or
+tail of the data frame to make sure it worked.
+
+
+
+
+
+
+
+
+
Let’s try this on the pop column of the
+gapminder dataset.
+
Make a new column in the gapminder data frame that
+contains population in units of millions of people. Check the head or
+tail of the data frame to make sure it worked.
Operations can also be performed on vectors of unequal length,
+through a process known as recycling. This process
+automatically repeats the smaller vector until it matches the length of
+the larger vector. R will provide a warning if the larger vector is not
+a multiple of the smaller vector.
+
+
R
+
+
+x<-c(1, 2, 3)
+y<-c(1, 2, 3, 4, 5, 6, 7)
+x+y
+
+
+
WARNING
+
+
Warning in x + y: longer object length is not a multiple of shorter object
+length
+
+
+
OUTPUT
+
+
[1] 2 4 6 5 7 9 8
+
+
Vector x was recycled to match the length of vector
+y
Check argument conditions with stopifnot() in
+functions.
+
Test a function.
+
Set default values for function arguments.
+
Explain why we should divide programs into small, single-purpose
+functions.
+
+
+
+
+
+
+
If we only had one data set to analyze, it would probably be faster
+to load the file into a spreadsheet and use that to plot simple
+statistics. However, the gapminder data is updated periodically, and we
+may want to pull in that new information later and re-run our analysis
+again. We may also obtain similar data from a different source in the
+future.
+
In this lesson, we’ll learn how to write a function so that we can
+repeat several operations with a single command.
+
+
+
+
+
+
What is a function?
+
+
+
Functions gather a sequence of operations into a whole, preserving it
+for ongoing use. Functions provide:
+
+
a name we can remember and invoke it by
+
relief from the need to remember the individual operations
+
a defined set of inputs and expected outputs
+
rich connections to the larger programming environment
+
+
As the basic building block of most programming languages,
+user-defined functions constitute “programming” as much as any single
+abstraction can. If you have written a function, you are a computer
+programmer.
+
+
+
+
Defining a function
+
+
+
Let’s open a new R script file in the functions/
+directory and call it functions-lesson.R.
+
The general structure of a function is:
+
+
R
+
+
+my_function<-function(parameters){
+# perform action
+# return value
+}
+
+
Let’s define a function fahr_to_kelvin() that converts
+temperatures from Fahrenheit to Kelvin:
We define fahr_to_kelvin() by assigning it to the output
+of function. The list of argument names are contained
+within parentheses. Next, the body of
+the function–the statements that are executed when it runs–is contained
+within curly braces ({}). The statements in the body are
+indented by two spaces. This makes the code easier to read but does not
+affect how the code operates.
+
It is useful to think of creating functions like writing a cookbook.
+First you define the “ingredients” that your function needs. In this
+case, we only need one ingredient to use our function: “temp”. After we
+list our ingredients, we then say what we will do with them, in this
+case, we are taking our ingredient and applying a set of mathematical
+operators to it.
+
When we call the function, the values we pass to it as arguments are
+assigned to those variables so that we can use them inside the function.
+Inside the function, we use a return statement to send a
+result back to whoever asked for it.
+
+
+
+
+
+
Tip
+
+
+
One feature unique to R is that the return statement is not required.
+R automatically returns whichever variable is on the last line of the
+body of the function. But for clarity, we will explicitly define the
+return statement.
+
+
+
+
Let’s try running our function. Calling our own function is no
+different from calling any other function:
+
+
R
+
+
+# freezing point of water
+fahr_to_kelvin(32)
+
+
+
OUTPUT
+
+
[1] 273.15
+
+
+
R
+
+
+# boiling point of water
+fahr_to_kelvin(212)
+
+
+
OUTPUT
+
+
[1] 373.15
+
+
+
+
+
+
+
Challenge 1
+
+
+
Write a function called kelvin_to_celsius() that takes a
+temperature in Kelvin and returns that temperature in Celsius.
+
Hint: To convert from Kelvin to Celsius you subtract 273.15
+
+
+
+
+
+
+
+
+
Write a function called kelvin_to_celsius that takes a
+temperature in Kelvin and returns that temperature in Celsius
Now that we’ve begun to appreciate how writing functions provides an
+efficient way to make R code re-usable and modular, we should note that
+it is important to ensure that functions only work in their intended
+use-cases. Checking function parameters is related to the concept of
+defensive programming. Defensive programming encourages us to
+frequently check conditions and throw an error if something is wrong.
+These checks are referred to as assertion statements because we want to
+assert some condition is TRUE before proceeding. They make
+it easier to debug because they give us a better idea of where the
+errors originate.
+
+
Checking conditions with stopifnot()
+
+
+
Let’s start by re-examining fahr_to_kelvin(), our
+function for converting temperatures from Fahrenheit to Kelvin. It was
+defined like so:
For this function to work as intended, the argument temp
+must be a numeric value; otherwise, the mathematical
+procedure for converting between the two temperature scales will not
+work. To create an error, we can use the function stop().
+For example, since the argument temp must be a
+numeric vector, we could check for this condition with an
+if statement and throw an error if the condition was
+violated. We could augment our function above like so:
+
+
R
+
+
+fahr_to_kelvin<-function(temp){
+if(!is.numeric(temp)){
+stop("temp must be a numeric vector.")
+}
+kelvin<-((temp-32)*(5/9))+273.15
+return(kelvin)
+}
+
+
If we had multiple conditions or arguments to check, it would take
+many lines of code to check all of them. Luckily R provides the
+convenience function stopifnot(). We can list as many
+requirements that should evaluate to TRUE;
+stopifnot() throws an error if it finds one that is
+FALSE. Listing these conditions also serves a secondary
+purpose as extra documentation for the function.
+
Let’s try out defensive programming with stopifnot() by
+adding assertions to check the input to our function
+fahr_to_kelvin().
+
We want to assert the following: temp is a numeric
+vector. We may do that like so:
+# freezing point of water
+fahr_to_kelvin(temp =32)
+
+
+
OUTPUT
+
+
[1] 273.15
+
+
But fails instantly if given improper input.
+
+
R
+
+
+# Metric is a factor instead of numeric
+fahr_to_kelvin(temp =as.factor(32))
+
+
+
ERROR
+
+
Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE
+
+
+
+
+
+
+
Challenge 3
+
+
+
Use defensive programming to ensure that our
+fahr_to_celsius() function throws an error immediately if
+the argument temp is specified inappropriately.
+
+
+
+
+
+
+
+
+
Extend our previous definition of the function by adding in an
+explicit call to stopifnot(). Since
+fahr_to_celsius() is a composition of two other functions,
+checking inside here makes adding checks to the two component functions
+redundant.
Now, we’re going to define a function that calculates the Gross
+Domestic Product of a nation from the data available in our dataset:
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat){
+gdp<-dat$pop*dat$gdpPercap
+return(gdp)
+}
+
+
We define calcGDP() by assigning it to the output of
+function. The list of argument names are contained within
+parentheses. Next, the body of the function -- the statements executed
+when you call the function – is contained within curly braces
+({}).
+
We’ve indented the statements in the body by two spaces. This makes
+the code easier to read but does not affect how it operates.
+
When we call the function, the values we pass to it are assigned to
+the arguments, which become variables inside the body of the
+function.
+
Inside the function, we use the return() function to
+send back the result. This return() function is optional: R
+will automatically return the results of whatever command is executed on
+the last line of the function.
That’s not very informative. Let’s add some more arguments so we can
+extract that per year and country.
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat, year=NULL, country=NULL){
+if(!is.null(year)){
+dat<-dat[dat$year%in%year, ]
+}
+if(!is.null(country)){
+dat<-dat[dat$country%in%country,]
+}
+gdp<-dat$pop*dat$gdpPercap
+
+new<-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
If you’ve been writing these functions down into a separate R script
+(a good idea!), you can load in the functions into our R session by
+using the source() function:
+
+
R
+
+
+source("functions/functions-lesson.R")
+
+
Ok, so there’s a lot going on in this function now. In plain English,
+the function now subsets the provided data by year if the year argument
+isn’t empty, then subsets the result by country if the country argument
+isn’t empty. Then it calculates the GDP for whatever subset emerges from
+the previous two steps. The function then adds the GDP as a new column
+to the subsetted data and returns this as the final result. You can see
+that the output is much more informative than a vector of numbers.
+
Let’s take a look at what happens when we specify the year:
+
+
R
+
+
+head(calcGDP(gapminder, year=2007))
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap gdp
+12 Afghanistan 2007 31889923 Asia 43.828 974.5803 31079291949
+24 Albania 2007 3600523 Europe 76.423 5937.0295 21376411360
+36 Algeria 2007 33333216 Africa 72.301 6223.3675 207444851958
+48 Angola 2007 12420476 Africa 42.731 4797.2313 59583895818
+60 Argentina 2007 40301927 Americas 75.320 12779.3796 515033625357
+72 Australia 2007 20434176 Oceania 81.235 34435.3674 703658358894
+
+
Or for a specific country:
+
+
R
+
+
+calcGDP(gapminder, country="Australia")
+
+
+
OUTPUT
+
+
country year pop continent lifeExp gdpPercap gdp
+61 Australia 1952 8691212 Oceania 69.120 10039.60 87256254102
+62 Australia 1957 9712569 Oceania 70.330 10949.65 106349227169
+63 Australia 1962 10794968 Oceania 70.930 12217.23 131884573002
+64 Australia 1967 11872264 Oceania 71.100 14526.12 172457986742
+65 Australia 1972 13177000 Oceania 71.930 16788.63 221223770658
+66 Australia 1977 14074100 Oceania 73.490 18334.20 258037329175
+67 Australia 1982 15184200 Oceania 74.740 19477.01 295742804309
+68 Australia 1987 16257249 Oceania 76.320 21888.89 355853119294
+69 Australia 1992 17481977 Oceania 77.560 23424.77 409511234952
+70 Australia 1997 18565243 Oceania 78.830 26997.94 501223252921
+71 Australia 2002 19546792 Oceania 80.370 30687.75 599847158654
+72 Australia 2007 20434176 Oceania 81.235 34435.37 703658358894
Here we’ve added two arguments, year, and
+country. We’ve set default arguments for both as
+NULL using the = operator in the function
+definition. This means that those arguments will take on those values
+unless the user specifies otherwise.
Here, we check whether each additional argument is set to
+null, and whenever they’re not null overwrite
+the dataset stored in dat with a subset given by the
+non-null argument.
+
Building these conditionals into the function makes it more flexible
+for later. Now, we can use it to calculate the GDP for:
+
+
The whole dataset;
+
A single year;
+
A single country;
+
A single combination of year and country.
+
+
By using %in% instead, we can also give multiple years
+or countries to those arguments.
+
+
+
+
+
+
Tip: Pass by value
+
+
+
Functions in R almost always make copies of the data to operate on
+inside of a function body. When we modify dat inside the
+function we are modifying the copy of the gapminder dataset stored in
+dat, not the original variable we gave as the first
+argument.
+
This is called “pass-by-value” and it makes writing code much safer:
+you can always be sure that whatever changes you make within the body of
+the function, stay inside the body of the function.
+
+
+
+
+
+
+
+
+
Tip: Function scope
+
+
+
Another important concept is scoping: any variables (or functions!)
+you create or modify inside the body of a function only exist for the
+lifetime of the function’s execution. When we call
+calcGDP(), the variables dat, gdp
+and new only exist inside the body of the function. Even if
+we have variables of the same name in our interactive R session, they
+are not modified in any way when executing a function.
+
+
+
+
+
R
+
+
gdp <- dat$pop * dat$gdpPercap
+ new <-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
Finally, we calculated the GDP on our new subset, and created a new
+data frame with that column added. This means when we call the function
+later we can see the context for the returned GDP values, which is much
+better than in our first attempt where we got a vector of numbers.
+
+
+
+
+
+
Challenge 4
+
+
+
Test out your GDP function by calculating the GDP for New Zealand in
+1987. How does this differ from New Zealand’s GDP in 1952?
+
+
+
+
+
+
+
+
+
+
R
+
+
+calcGDP(gapminder, year =c(1952, 1987), country ="New Zealand")
+
+
GDP for New Zealand in 1987: 65050008703
+
GDP for New Zealand in 1952: 21058193787
+
+
+
+
+
+
+
+
+
+
Challenge 5
+
+
+
The paste() function can be used to combine text
+together, e.g:
Write a function called fence() that takes two vectors
+as arguments, called text and wrapper, and
+prints out the text wrapped with the wrapper:
+
+
R
+
+
+fence(text=best_practice, wrapper="***")
+
+
Note: the paste() function has an argument
+called sep, which specifies the separator between text. The
+default is a space: ” “. The default for paste0() is no
+space”“.
+
+
+
+
+
+
+
+
+
Write a function called fence() that takes two vectors
+as arguments, called text and wrapper, and
+prints out the text wrapped with the wrapper:
[1] "*** Write programs for people not computers ***"
+
+
+
+
+
+
+
+
+
+
+
Tip
+
+
+
R has some unique aspects that can be exploited when performing more
+complicated operations. We will not be writing anything that requires
+knowledge of these more advanced concepts. In the future when you are
+comfortable writing functions in R, you can learn more by reading the R
+Language Manual or this chapter from Advanced R Programming by Hadley
+Wickham.
+
+
+
+
+
+
+
+
+
Tip: Testing and documenting
+
+
+
It’s important to both test functions and document them:
+Documentation helps you, and others, understand what the purpose of your
+function is, and how to use it, and its important to make sure that your
+function actually does what you think.
+
When you first start out, your workflow will probably look a lot like
+this:
+
+
Write a function
+
Comment parts of the function to document its behaviour
+
Load in the source file
+
Experiment with it in the console to make sure it behaves as you
+expect
+
Make any necessary bug fixes
+
Rinse and repeat.
+
+
Formal documentation for functions, written in separate
+.Rd files, gets turned into the documentation you see in
+help files. The roxygen2
+package allows R coders to write documentation alongside the function
+code and then process it into the appropriate .Rd files.
+You will want to switch to this more formal method of writing
+documentation when you start writing more complicated R projects. In
+fact, packages are, in essence, bundles of functions with this formal
+documentation. Loading your own functions through
+source("functions.R") is equivalent to loading someone
+else’s functions (or your own one day!) through
+library("package").
+
Formal automated tests can be written using the testthat package.
+
+
+
+
+
+
+
+
+
Keypoints
+
+
+
+
Use function to define a new function in R.
+
Use parameters to pass values into functions.
+
Use stopifnot() to flexibly check function arguments in
+R.
You have already seen how to save the most recent plot you create in
+ggplot2, using the command ggsave. As a
+refresher:
+
+
R
+
+
+ggsave("My_most_recent_plot.pdf")
+
+
You can save a plot from within RStudio using the ‘Export’ button in
+the ‘Plot’ window. This will give you the option of saving as a .pdf or
+as .png, .jpg or other image formats.
+
Sometimes you will want to save plots without creating them in the
+‘Plot’ window first. Perhaps you want to make a pdf document with
+multiple pages: each one a different plot, for example. Or perhaps
+you’re looping through multiple subsets of a file, plotting data from
+each subset, and you want to save each plot, but obviously can’t stop
+the loop to click ‘Export’ for each one.
+
In this case you can use a more flexible approach. The function
+pdf creates a new pdf device. You can control the size and
+resolution using the arguments to this function.
+
+
R
+
+
+pdf("Life_Exp_vs_time.pdf", width=12, height=4)
+ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country))+
+geom_line()+
+theme(legend.position ="none")
+
+# You then have to make sure to turn off the pdf device!
+
+dev.off()
+
+
Open up this document and have a look.
+
+
+
+
+
+
Challenge 1
+
+
+
Rewrite your ‘pdf’ command to print a second page in the pdf, showing
+a facet plot (hint: use facet_grid) of the same data with
+one panel per continent.
How can I do different calculations on different sets of data?
+
+
+
+
+
+
+
+
Objectives
+
+
To be able to use the split-apply-combine strategy for data
+analysis.
+
+
+
+
+
+
+
Previously we looked at how you can use functions to simplify your
+code. We defined the calcGDP function, which takes the
+gapminder dataset, and multiplies the population and GDP per capita
+column. We also defined additional arguments so we could filter by
+year and country:
+
+
R
+
+
+# Takes a dataset and multiplies the population column
+# with the GDP per capita column.
+calcGDP<-function(dat, year=NULL, country=NULL){
+if(!is.null(year)){
+dat<-dat[dat$year%in%year, ]
+}
+if(!is.null(country)){
+dat<-dat[dat$country%in%country,]
+}
+gdp<-dat$pop*dat$gdpPercap
+
+new<-cbind(dat, gdp=gdp)
+return(new)
+}
+
+
A common task you’ll encounter when working with data, is that you’ll
+want to run calculations on different groups within the data. In the
+above, we were calculating the GDP by multiplying two columns together.
+But what if we wanted to calculated the mean GDP per continent?
+
We could run calcGDP and then take the mean of each
+continent:
But this isn’t very nice. Yes, by using a function, you have
+reduced a substantial amount of repetition. That is
+nice. But there is still repetition. Repeating yourself will cost you
+time, both now and later, and potentially introduce some nasty bugs.
+
We could write a new function that is flexible like
+calcGDP, but this also takes a substantial amount of effort
+and testing to get right.
+
The abstract problem we’re encountering here is know as
+“split-apply-combine”:
+
We want to split our data into groups, in this case
+continents, apply some calculations on that group, then
+optionally combine the results together afterwards.
+
The plyr package
+
+
+
For those of you who have used R before, you might be familiar with
+the apply family of functions. While R’s built in functions
+do work, we’re going to introduce you to another method for solving the
+“split-apply-combine” problem. The plyr package provides a set of
+functions that we find more user friendly for solving this problem.
+
We installed this package in an earlier challenge. Let us load it
+now:
+
+
R
+
+
+library("plyr")
+
+
Plyr has functions for operating on lists,
+data.frames and arrays (matrices, or
+n-dimensional vectors). Each function performs:
+
+
A splitting operation
+
+Apply a function on each split in turn.
+
Recombine output data as a single data object.
+
+
The functions are named based on the data structure they expect as
+input, and the data structure you want returned as output: [a]rray,
+[l]ist, or [d]ata.frame. The first letter corresponds to the input data
+structure, the second letter to the output data structure, and then the
+rest of the function is named “ply”.
+
This gives us 9 core functions **ply. There are an additional three
+functions which will only perform the split and apply steps, and not any
+combine step. They’re named by their input data type and represent null
+output by a _ (see table)
+
Note here that plyr’s use of “array” is different to R’s, an array in
+ply can include a vector or matrix.
+
Each of the xxply functions (daply, ddply,
+llply, laply, …) has the same structure and
+has 4 key features and structure:
+
+
R
+
+
+xxply(.data, .variables, .fun)
+
+
+
The first letter of the function name gives the input type and the
+second gives the output type.
+
.data - gives the data object to be processed
+
.variables - identifies the splitting variables
+
.fun - gives the function to be called on each piece
+
+
Now we can quickly calculate the mean GDP per continent:
continent V1
+1 Africa 20904782844
+2 Americas 379262350210
+3 Asia 227233738153
+4 Europe 269442085301
+5 Oceania 188187105354
+
+
Let us walk through the previous code:
+
+
The ddply function feeds in a data.frame
+(function starts with d) and returns another
+data.frame (2nd letter is a d)
+
the first argument we gave was the data.frame we wanted to operate
+on: in this case the gapminder data. We called calcGDP on
+it first so that it would have the additional gdp column
+added to it.
+
The second argument indicated our split criteria: in this case the
+“continent” column. Note that we gave the name of the column, not the
+values of the column like we had done previously with subsetting. Plyr
+takes care of these implementation details for you.
+
The third argument is the function we want to apply to each grouping
+of the data. We had to define our own short function here: each subset
+of the data gets stored in x, the first argument of our
+function. This is an anonymous function: we haven’t defined it
+elsewhere, and it has no name. It only exists in the scope of our call
+to ddply.
+
+
+
+
+
+
+
Challenge 1
+
+
+
Calculate the average life expectancy per continent. Which has the
+longest? Which has the shortest?
year
+continent 1952 1957 1962 1967 1972
+ Africa 5992294608 7359188796 8784876958 11443994101 15072241974
+ Americas 117738997171 140817061264 169153069442 217867530844 268159178814
+ Asia 34095762661 47267432088 60136869012 84648519224 124385747313
+ Europe 84971341466 109989505140 138984693095 173366641137 218691462733
+ Oceania 54157223944 66826828013 82336453245 105958863585 134112109227
+ year
+continent 1977 1982 1987 1992 1997
+ Africa 18694898732 22040401045 24107264108 26256977719 30023173824
+ Americas 324085389022 363314008350 439447790357 489899820623 582693307146
+ Asia 159802590186 194429049919 241784763369 307100497486 387597655323
+ Europe 255367522034 279484077072 316507473546 342703247405 383606933833
+ Oceania 154707711162 176177151380 209451563998 236319179826 289304255183
+ year
+continent 2002 2007
+ Africa 35303511424 45778570846
+ Americas 661248623419 776723426068
+ Asia 458042336179 627513635079
+ Europe 436448815097 493183311052
+ Oceania 345236880176 403657044512
+
+
You can use these functions in place of for loops (and
+it is usually faster to do so). To replace a for loop, put the code that
+was in the body of the for loop inside an anonymous
+function.
+
+
R
+
+
+d_ply(
+ .data=gapminder,
+ .variables ="continent",
+ .fun =function(x){
+meanGDPperCap<-mean(x$gdpPercap)
+print(paste(
+"The mean GDP per capita for", unique(x$continent),
+"is", format(meanGDPperCap, big.mark=",")
+))
+}
+)
+
+
+
OUTPUT
+
+
[1] "The mean GDP per capita for Africa is 2,193.755"
+[1] "The mean GDP per capita for Americas is 7,136.11"
+[1] "The mean GDP per capita for Asia is 7,902.15"
+[1] "The mean GDP per capita for Europe is 14,469.48"
+[1] "The mean GDP per capita for Oceania is 18,621.61"
+
+
+
+
+
+
+
Tip: printing numbers
+
+
+
The format function can be used to make numeric values
+“pretty” for printing out in messages.
+
+
+
+
+
+
+
+
+
Challenge 2
+
+
+
Calculate the average life expectancy per continent and year. Which
+had the longest and shortest in 2007? Which had the greatest change in
+between 1952 and 2007?
How can I manipulate data frames without repeating myself?
+
+
+
+
+
+
+
+
Objectives
+
+
To be able to use the six main data frame manipulation ‘verbs’ with
+pipes in dplyr.
+
To understand how group_by() and
+summarize() can be combined to summarize datasets.
+
Be able to analyze a subset of data using logical filtering.
+
+
+
+
+
+
+
Manipulation of data frames means many things to many researchers: we
+often select certain observations (rows) or variables (columns), we
+often group the data by a certain variable(s), or we even calculate
+summary statistics. We can do these operations using the normal base R
+operations:
But this isn’t very nice because there is a fair bit of
+repetition. Repeating yourself will cost you time, both now and later,
+and potentially introduce some nasty bugs.
+
The dplyr package
+
+
+
Luckily, the dplyr
+package provides a number of very useful functions for manipulating data
+frames in a way that will reduce the above repetition, reduce the
+probability of making errors, and probably even save you some typing. As
+an added bonus, you might even find the dplyr grammar
+easier to read.
+
+
+
+
+
+
Tip: Tidyverse
+
+
+
dplyr package belongs to a broader family of opinionated
+R packages designed for data science called the “Tidyverse”. These
+packages are specifically designed to work harmoniously together. Some
+of these packages will be covered along this course, but you can find
+more complete information here: https://www.tidyverse.org/.
+
+
+
+
Here we’re going to cover 5 of the most commonly used functions as
+well as using pipes (%>%) to combine them.
+
+
select()
+
filter()
+
group_by()
+
summarize()
+
mutate()
+
+
If you have have not installed this package earlier, please do
+so:
+
+
R
+
+
+install.packages('dplyr')
+
+
Now let’s load the package:
+
+
R
+
+
+library("dplyr")
+
+
Using select()
+
+
+
If, for example, we wanted to move forward with only a few of the
+variables in our data frame we could use the select()
+function. This will keep only the variables you select.
If we open up year_country_gdp we’ll see that it only
+contains the year, country and gdpPercap. Above we used ‘normal’
+grammar, but the strengths of dplyr lie in combining
+several functions using pipes. Since the pipes grammar is unlike
+anything we’ve seen in R before, let’s repeat what we’ve done above
+using pipes.
To help you understand why we wrote that in that way, let’s walk
+through it step by step. First we summon the gapminder data frame and
+pass it on, using the pipe symbol %>%, to the next step,
+which is the select() function. In this case we don’t
+specify which data object we use in the select() function
+since in gets that from the previous pipe. Fun Fact:
+There is a good chance you have encountered pipes before in the shell.
+In R, a pipe symbol is %>% while in the shell it is
+| but the concept is the same!
+
+
+
+
+
+
Tip: Renaming data frame columns in dplyr
+
+
+
In Chapter 4 we covered how you can rename columns with base R by
+assigning a value to the output of the names() function.
+Just like select, this is a bit cumbersome, but thankfully dplyr has a
+rename() function.
+
Within a pipeline, the syntax is
+rename(new_name = old_name). For example, we may want to
+rename the gdpPercap column name from our select()
+statement above.
Write a single command (which can span multiple lines and includes
+pipes) that will produce a data frame that has the African values for
+lifeExp, country and year, but
+not for other Continents. How many rows does your data frame have and
+why?
As with last time, first we pass the gapminder data frame to the
+filter() function, then we pass the filtered version of the
+gapminder data frame to the select() function.
+Note: The order of operations is very important in this
+case. If we used ‘select’ first, filter would not be able to find the
+variable continent since we would have removed it in the previous
+step.
+
Using group_by()
+
+
+
Now, we were supposed to be reducing the error prone repetitiveness
+of what can be done with base R, but up to now we haven’t done that
+since we would have to repeat the above for each continent. Instead of
+filter(), which will only pass observations that meet your
+criteria (in the above: continent=="Europe"), we can use
+group_by(), which will essentially use every unique
+criteria that you could have used in filter.
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
You will notice that the structure of the data frame where we used
+group_by() (grouped_df) is not the same as the
+original gapminder (data.frame). A
+grouped_df can be thought of as a list where
+each item in the listis a data.frame which
+contains only the rows that correspond to the a particular value
+continent (at least in the example above).
+
Using summarize()
+
+
+
The above was a bit on the uneventful side but
+group_by() is much more exciting in conjunction with
+summarize(). This will allow us to create new variable(s)
+by using functions that repeat for each of the continent-specific data
+frames. That is to say, using the group_by() function, we
+split our original data frame into multiple pieces, then we can run
+functions (e.g. mean() or sd()) within
+summarize().
# A tibble: 2 × 2
+ country mean_lifeExp
+ <chr> <dbl>
+1 Iceland 76.5
+2 Sierra Leone 36.8
+
+
Another way to do this is to use the dplyr function
+arrange(), which arranges the rows in a data frame
+according to the order of one or more variables from the data frame. It
+has similar syntax to other functions from the dplyr
+package. You can use desc() inside arrange()
+to sort in descending order.
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
count() and n()
+
+
+
A very common operation is to count the number of observations for
+each group. The dplyr package comes with two related
+functions that help with this.
+
For instance, if we wanted to check the number of countries included
+in the dataset for the year 2002, we can use the count()
+function. It takes the name of one or more columns that contain the
+groups we are interested in, and we can optionally sort the results in
+descending order by adding sort=TRUE:
continent n
+1 Africa 52
+2 Asia 33
+3 Europe 30
+4 Americas 25
+5 Oceania 2
+
+
If we need to use the number of observations in calculations, the
+n() function is useful. It will return the total number of
+observations in the current group rather than counting the number of
+observations in each group within a specific column. For instance, if we
+wanted to get the standard error of the life expectency per
+continent:
# A tibble: 5 × 2
+ continent se_le
+ <chr> <dbl>
+1 Africa 0.366
+2 Americas 0.540
+3 Asia 0.596
+4 Europe 0.286
+5 Oceania 0.775
+
+
You can also chain together several summary operations; in this case
+calculating the minimum, maximum,
+mean and se of each continent’s per-country
+life-expectancy:
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
Connect mutate with logical filtering: ifelse
+
+
+
When creating new variables, we can hook this with a logical
+condition. A simple combination of mutate() and
+ifelse() facilitates filtering right where it is needed: in
+the moment of creating something new. This easy-to-read statement is a
+fast and powerful way of discarding certain data (even though the
+overall dimension of the data frame will not change) or for updating
+values depending on this given condition.
+
+
R
+
+
+## keeping all data but "filtering" after a certain condition
+# calculate GDP only for people with a life expectation above 25
+gdp_pop_bycontinents_byyear_above25<-gapminder%>%
+mutate(gdp_billion =ifelse(lifeExp>25, gdpPercap*pop/10^9, NA))%>%
+group_by(continent, year)%>%
+summarize(mean_gdpPercap =mean(gdpPercap),
+ sd_gdpPercap =sd(gdpPercap),
+ mean_pop =mean(pop),
+ sd_pop =sd(pop),
+ mean_gdp_billion =mean(gdp_billion),
+ sd_gdp_billion =sd(gdp_billion))
+
+
+
OUTPUT
+
+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+
R
+
+
+## updating only if certain condition is fullfilled
+# for life expectations above 40 years, the gpd to be expected in the future is scaled
+gdp_future_bycontinents_byyear_high_lifeExp<-gapminder%>%
+mutate(gdp_futureExpectation =ifelse(lifeExp>40, gdpPercap*1.5, gdpPercap))%>%
+group_by(continent, year)%>%
+summarize(mean_gdpPercap =mean(gdpPercap),
+ mean_gdpPercap_expected =mean(gdp_futureExpectation))
+
+
+
OUTPUT
+
+
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
Combining dplyr and ggplot2
+
+
+
First install and load ggplot2:
+
+
R
+
+
+install.packages('ggplot2')
+
+
+
R
+
+
+library("ggplot2")
+
+
In the plotting lesson we looked at how to make a multi-panel figure
+by adding a layer of facet panels using ggplot2. Here is
+the code we used (with some extra comments):
+
+
R
+
+
+# Filter countries located in the Americas
+americas<-gapminder[gapminder$continent=="Americas", ]
+# Make the plot
+ggplot(data =americas, mapping =aes(x =year, y =lifeExp))+
+geom_line()+
+facet_wrap(~country)+
+theme(axis.text.x =element_text(angle =45))
+
+
This code makes the right plot but it also creates an intermediate
+variable (americas) that we might not have any other uses
+for. Just as we used %>% to pipe data along a chain of
+dplyr functions we can use it to pass data to
+ggplot(). Because %>% replaces the first
+argument in a function we don’t need to specify the data =
+argument in the ggplot() function. By combining
+dplyr and ggplot2 functions we can make the
+same figure without creating any new variables or modifying the
+data.
+
+
R
+
+
+gapminder%>%
+# Filter countries located in the Americas
+filter(continent=="Americas")%>%
+# Make the plot
+ggplot(mapping =aes(x =year, y =lifeExp))+
+geom_line()+
+facet_wrap(~country)+
+theme(axis.text.x =element_text(angle =45))
+
+
More examples of using the function mutate() and the
+ggplot2 package.
+
+
R
+
+
+gapminder%>%
+# extract first letter of country name into new column
+mutate(startsWith =substr(country, 1, 1))%>%
+# only keep countries starting with A or Z
+filter(startsWith%in%c("A", "Z"))%>%
+# plot lifeExp into facets
+ggplot(aes(x =year, y =lifeExp, colour =continent))+
+geom_line()+
+facet_wrap(vars(country))+
+theme_minimal()
+
+
+
+
+
+
+
Advanced Challenge
+
+
+
Calculate the average life expectancy in 2002 of 2 randomly selected
+countries for each continent. Then arrange the continent names in
+reverse order. Hint: Use the dplyr
+functions arrange() and sample_n(), they have
+similar syntax to other dplyr functions.
To understand the concepts of ‘longer’ and ‘wider’ data frame
+formats and be able to convert between them with
+tidyr.
+
+
+
+
+
+
+
Researchers often want to reshape their data frames from ‘wide’ to
+‘longer’ layouts, or vice-versa. The ‘long’ layout or format is
+where:
+
+
each column is a variable
+
each row is an observation
+
+
In the purely ‘long’ (or ‘longest’) format, you usually have 1 column
+for the observed variable and the other columns are ID variables.
+
For the ‘wide’ format each row is often a site/subject/patient and
+you have multiple observation variables containing the same type of
+data. These can be either repeated observations over time, or
+observation of multiple variables (or a mix of both). You may find data
+input may be simpler or some other applications may prefer the ‘wide’
+format. However, many of R‘s functions have been designed
+assuming you have ’longer’ formatted data. This tutorial will help you
+efficiently transform your data shape regardless of original format.
+
Long and wide data frame layouts mainly affect readability. For
+humans, the wide format is often more intuitive since we can often see
+more of the data on the screen due to its shape. However, the long
+format is more machine readable and is closer to the formatting of
+databases. The ID variables in our data frames are similar to the fields
+in a database and observed variables are like the database values.
+
Getting started
+
+
+
First install the packages if you haven’t already done so (you
+probably installed dplyr in the previous lesson):
First, lets look at the structure of our original gapminder data
+frame:
+
+
R
+
+
+str(gapminder)
+
+
+
OUTPUT
+
+
'data.frame': 1704 obs. of 6 variables:
+ $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
+ $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp : num 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num 779 821 853 836 740 ...
+
+
+
+
+
+
+
Challenge 1
+
+
+
Is gapminder a purely long, purely wide, or some intermediate
+format?
+
+
+
+
+
+
+
+
+
The original gapminder data.frame is in an intermediate format. It is
+not purely long since it had multiple observation variables
+(pop,lifeExp,gdpPercap).
+
+
+
+
+
Sometimes, as with the gapminder dataset, we have multiple types of
+observed data. It is somewhere in between the purely ‘long’ and ‘wide’
+data formats. We have 3 “ID variables” (continent,
+country, year) and 3 “Observation variables”
+(pop,lifeExp,gdpPercap). This
+intermediate format can be preferred despite not having ALL observations
+in 1 column given that all 3 observation variables have different units.
+There are few operations that would need us to make this data frame any
+longer (i.e. 4 ID variables and 1 Observation variable).
+
While using many of the functions in R, which are often vector based,
+you usually do not want to do mathematical operations on values with
+different units. For example, using the purely long format, a single
+mean for all of the values of population, life expectancy, and GDP would
+not be meaningful since it would return the mean of values with 3
+incompatible units. The solution is that we first manipulate the data
+either by grouping (see the lesson on dplyr), or we change
+the structure of the data frame. Note: Some plotting
+functions in R actually work better in the wide format data.
+
From wide to long format with pivot_longer()
+
+
+
Until now, we’ve been using the nicely formatted original gapminder
+dataset, but ‘real’ data (i.e. our own research data) will never be so
+well organized. Here let’s start with the wide formatted version of the
+gapminder dataset.
+
+
Download the wide version of the gapminder data from here and save it in your data
+folder.
+
+
We’ll load the data file and look at it. Note: we don’t want our
+continent and country columns to be factors, so we use the
+stringsAsFactors argument for read.csv() to disable
+that.
To change this very wide data frame layout back to our nice,
+intermediate (or longer) layout, we will use one of the two available
+pivot functions from the tidyr package. To
+convert from wide to a longer format, we will use the
+pivot_longer() function. pivot_longer() makes
+datasets longer by increasing the number of rows and decreasing the
+number of columns, or ‘lengthening’ your observation variables into a
+single variable.
Here we have used piping syntax which is similar to what we were
+doing in the previous lesson with dplyr. In fact, these are compatible
+and you can use a mix of tidyr and dplyr functions by piping them
+together.
+
We first provide to pivot_longer() a vector of column
+names that will be pivoted into longer format. We could type out all the
+observation variables, but as in the select() function (see
+dplyr lesson), we can use the starts_with()
+argument to select all variables that start with the desired character
+string. pivot_longer() also allows the alternative syntax
+of using the - symbol to identify which variables are not
+to be pivoted (i.e. ID variables).
+
The next arguments to pivot_longer() are
+names_to for naming the column that will contain the new ID
+variable (obstype_year) and values_to for
+naming the new amalgamated observation variable
+(obs_value). We supply these new column names as
+strings.
That may seem trivial with this particular data frame, but sometimes
+you have 1 ID variable and 40 observation variables with irregular
+variable names. The flexibility is a huge time saver!
+
Now obstype_year actually contains 2 pieces of
+information, the observation type
+(pop,lifeExp, or gdpPercap) and
+the year. We can use the separate() function
+to split the character strings into multiple variables
+
+
R
+
+
+gap_long<-gap_long%>%separate(obstype_year, into =c('obs_type', 'year'), sep ="_")
+gap_long$year<-as.integer(gap_long$year)
+
+
+
+
+
+
+
Challenge 2
+
+
+
Using gap_long, calculate the mean life expectancy,
+population, and gdpPercap for each continent. Hint: use
+the group_by() and summarize() functions we
+learned in the dplyr lesson
`summarise()` has grouped output by 'continent'. You can override using the
+`.groups` argument.
+
+
+
OUTPUT
+
+
# A tibble: 15 × 3
+# Groups: continent [5]
+ continent obs_type means
+ <chr> <chr> <dbl>
+ 1 Africa gdpPercap 2194.
+ 2 Africa lifeExp 48.9
+ 3 Africa pop 9916003.
+ 4 Americas gdpPercap 7136.
+ 5 Americas lifeExp 64.7
+ 6 Americas pop 24504795.
+ 7 Asia gdpPercap 7902.
+ 8 Asia lifeExp 60.1
+ 9 Asia pop 77038722.
+10 Europe gdpPercap 14469.
+11 Europe lifeExp 71.9
+12 Europe pop 17169765.
+13 Oceania gdpPercap 18622.
+14 Oceania lifeExp 74.3
+15 Oceania pop 8874672.
+
+
+
+
+
+
From long to intermediate format with pivot_wider()
+
+
+
It is always good to check work. So, let’s use the second
+pivot function, pivot_wider(), to ‘widen’ our
+observation variables back out. pivot_wider() is the
+opposite of pivot_longer(), making a dataset wider by
+increasing the number of columns and decreasing the number of rows. We
+can use pivot_wider() to pivot or reshape our
+gap_long to the original intermediate format or the widest
+format. Let’s start with the intermediate format.
+
The pivot_wider() function takes names_from
+and values_from arguments.
+
To names_from we supply the column name whose contents
+will be pivoted into new output columns in the widened data frame. The
+corresponding values will be added from the column named in the
+values_from argument.
Now we’ve got an intermediate data frame gap_normal with
+the same dimensions as the original gapminder, but the
+order of the variables is different. Let’s fix that before checking if
+they are all.equal().
That’s great! We’ve gone from the longest format back to the
+intermediate and we didn’t introduce any errors in our code.
+
Now let’s convert the long all the way back to the wide. In the wide
+format, we will keep country and continent as ID variables and pivot the
+observations across the 3 metrics
+(pop,lifeExp,gdpPercap) and time
+(year). First we need to create appropriate labels for all
+our new variables (time*metric combinations) and we also need to unify
+our ID variables to simplify the process of defining
+gap_wide.
Using unite() we now have a single ID variable which is
+a combination of continent,country,and we have
+defined variable names. We’re now ready to pipe in
+pivot_wider()
Take this 1 step further and create a
+gap_ludicrously_wide format data by pivoting over
+countries, year and the 3 metrics? Hint this new data
+frame should only have 5 rows.
Understand the value of writing reproducible reports
+
Learn how to recognise and compile the basic components of an R
+Markdown file
+
Become familiar with R code chunks, and understand their purpose,
+structure and options
+
Demonstrate the use of inline chunks for weaving R outputs into text
+blocks, for example when discussing the results of some
+calculations
+
Be aware of alternative output formats to which an R Markdown file
+can be exported
+
+
+
+
+
+
+
Data analysis reports
+
+
+
Data analysts tend to write a lot of reports, describing their
+analyses and results, for their collaborators or to document their work
+for future reference.
+
Many new users begin by first writing a single R script containing
+all of their work, and then share the analysis by emailing the script
+and various graphs as attachments. But this can be cumbersome, requiring
+a lengthy discussion to explain which attachment was which result.
+
Writing formal reports with Word or LaTeX can simplify this
+process by incorporating both the analysis report and output graphs into
+a single document. But tweaking formatting to make figures look correct
+and fixing obnoxious page breaks can be tedious and lead to a lengthy
+“whack-a-mole” game of fixing new mistakes resulting from a single
+formatting change.
+
Creating a report as a web page (which is an html file) using R
+Markdown makes things easier. The report can be one long stream, so tall
+figures that wouldn’t ordinarily fit on one page can be kept at full
+size and easier to read, since the reader can simply keep scrolling.
+Additionally, the formatting of and R Markdown document is simple and
+easy to modify, allowing you to spend more time on your analyses instead
+of writing reports.
+
Literate programming
+
+
+
Ideally, such analysis reports are reproducible documents:
+If an error is discovered, or if some additional subjects are added to
+the data, you can just re-compile the report and get the new or
+corrected results rather than having to reconstruct figures, paste them
+into a Word document, and hand-edit various detailed results.
+
The key R package here is knitr. It allows you
+to create a document that is a mixture of text and chunks of code. When
+the document is processed by knitr, chunks of code will be
+executed, and graphs or other results will be inserted into the final
+document.
+
This sort of idea has been called “literate programming”.
+
knitr allows you to mix basically any type of text with
+code from different programming languages, but we recommend that you use
+R Markdown, which mixes Markdown with R. Markdown is a light-weight
+mark-up language for creating web pages.
+
Creating an R Markdown file
+
+
+
Within RStudio, click File → New File → R Markdown and you’ll get a
+dialog box like this:
+
You can stick with the default (HTML output), but give it a
+title.
+
Basic components of R Markdown
+
+
+
The initial chunk of text (header) contains instructions for R to
+specify what kind of document will be created, and the options chosen.
+You can use the header to give your document a title, author, date, and
+tell it what type of output you want to produce. In this case, we’re
+creating an html document.
You can delete any of those fields if you don’t want them included.
+The double-quotes aren’t strictly necessary in this case.
+They’re mostly needed if you want to include a colon in the title.
+
RStudio creates the document with some example text to get you
+started. Note below that there are chunks like
+
+```{r}
+summary(cars)
+```
+
+
These are chunks of R code that will be executed by
+knitr and replaced by their results. More on this
+later.
+
Markdown
+
+
+
Markdown is a system for writing web pages by marking up the text
+much as you would in an email rather than writing html code. The
+marked-up text gets converted to html, replacing the marks with
+the proper html code.
+
For now, let’s delete all of the stuff that’s there and write a bit
+of markdown.
+
You make things bold using two asterisks, like this:
+**bold**, and you make things italics by using
+underscores, like this: _italics_.
+
You can make a bulleted list by writing a list with hyphens or
+asterisks with a space between the list and other text, like this:
+
A list:
+
+* bold with double-asterisks
+* italics with underscores
+* code-type font with backticks
+
or like this:
+
A second list:
+
+- bold with double-asterisks
+- italics with underscores
+- code-type font with backticks
+
Each will appear as:
+
+
bold with double-asterisks
+
italics with underscores
+
code-type font with backticks
+
+
You can use whatever method you prefer, but be consistent.
+This maintains the readability of your code.
+
You can make a numbered list by just using numbers. You can even use
+the same number over and over if you want:
+
1. bold with double-asterisks
+1. italics with underscores
+1. code-type font with backticks
+
This will appear as:
+
+
bold with double-asterisks
+
italics with underscores
+
code-type font with backticks
+
+
You can make section headers of different sizes by initiating a line
+with some number of # symbols:
+
# Title
+## Main section
+### Sub-section
+#### Sub-sub section
+
You compile the R Markdown document to an html webpage by
+clicking the “Knit” button in the upper-left.
+
+
+
+
+
+
Challenge 1
+
+
+
Create a new R Markdown document. Delete all of the R code chunks and
+write a bit of Markdown (some sections, some italicized text, and an
+itemized list).
+
Convert the document to a webpage.
+
+
+
+
+
+
+
+
+
In RStudio, select File > New file > R Markdown…
+
Delete the placeholder text and add the following:
+
# Introduction
+
+## Background on Data
+
+This report uses the *gapminder* dataset, which has columns that include:
+
+* country
+* continent
+* year
+* lifeExp
+* pop
+* gdpPercap
+
+## Background on Methods
+
+
Then click the ‘Knit’ button on the toolbar to generate an html
+document (webpage).
+
+
+
+
+
A bit more Markdown
+
+
+
You can make a hyperlink like this:
+[Carpentries Home Page](https://carpentries.org/).
+
You can include an image file like this:
+![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)
+
You can do subscripts (e.g., F2) with F~2~
+and superscripts (e.g., F2) with F^2^.
+
If you know how to write equations in LaTeX, you can use
+$ $ and $$ $$ to insert math equations, like
+$E = mc^2$ and
+
$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$
+
You can review Markdown syntax by navigating to the “Markdown Quick
+Reference” under the “Help” field in the toolbar at the top of
+RStudio.
+
R code chunks
+
+
+
The real power of Markdown comes from mixing markdown with chunks of
+code. This is R Markdown. When processed, the R code will be executed;
+if they produce figures, the figures will be inserted in the final
+document.
+
The main code chunks look like this:
+
+```{r load_data}
+gapminder
+
That is, you place a chunk of R code between ```{r
+chunk_name} and ```. You should give each chunk a
+unique name, as they will help you to fix errors and, if any graphs are
+produced, the file names are based on the name of the code chunk that
+produced them. You can create code chunks quickly in RStudio using the
+shortcuts Ctrl+Alt+I on Windows and
+Linux, or Cmd+Option+I on Mac.
+
+
+
+
+
+
Challenge 2
+
+
+
Add code chunks to:
+
+
Load the ggplot2 package
+
Read the gapminder data
+
Create a plot
+
+
+
+
+
+
+
+
+
+
+```{r load-ggplot2}
+library("ggplot2")
+```
+
+
+```{r read-gapminder-data}
+gapminder
+
+```{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+
+
+
+
+
+
How things get compiled
+
+
+
When you press the “Knit” button, the R Markdown document is
+processed by knitr
+and a plain Markdown document is produced (as well as, potentially, a
+set of figure files): the R code is executed and replaced by both the
+input and the output; if figures are produced, links to those figures
+are included.
+
The Markdown and figure documents are then processed by the tool pandoc, which converts the
+Markdown file into an html file, with the figures embedded.
+
Chunk options
+
+
+
There are a variety of options to affect how the code chunks are
+treated. Here are some examples:
+
+
Use echo=FALSE to avoid having the code itself
+shown.
+
Use results="hide" to avoid having any results
+printed.
+
Use eval=FALSE to have the code shown but not
+evaluated.
+
Use warning=FALSE and message=FALSE to
+hide any warnings or messages produced.
+
Use fig.height and fig.width to control
+the size of the figures produced (in inches).
The fig.path option defines where the figures will be
+saved. The / here is really important; without it, the
+figures would be saved in the standard place but just with names that
+begin with Figs.
+
If you have multiple R Markdown files in a common directory, you
+might want to use fig.path to define separate prefixes for
+the figure file names, like fig.path="Figs/cleaning-" and
+fig.path="Figs/analysis-".
+
+
+
+
+
+
Challenge 3
+
+
+
Use chunk options to control the size of a figure and to hide the
+code.
You can review all of the R chunk options by navigating
+to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the
+“Help” field in the toolbar at the top of RStudio.
+
Inline R code
+
+
+
You can make every number in your report reproducible. Use
+`r and ` for an in-line code chunk, like so:
+`r round(some_value, 2)`. The code will be executed and
+replaced with the value of the result.
+
Don’t let these in-line chunks get split across lines.
+
Perhaps precede the paragraph with a larger code chunk that does
+calculations and defines variables, with include=FALSE for
+that larger chunk (which is the same as echo=FALSE and
+results="hide").
+
Rounding can produce differences in output in such situations. You
+may want 2.0, but round(2.03, 1) will give
+just 2.
+
The myround
+function in the R/broman
+package handles this.
+
+
+
+
+
+
Challenge 4
+
+
+
Try out a bit of in-line R code.
+
+
+
+
+
+
+
+
+
Here’s some inline code to determine that 2 + 2 = 4.
+
+
+
+
+
Other output options
+
+
+
You can also convert R Markdown to a PDF or a Word document. Click
+the little triangle next to the “Knit” button to get a drop-down menu.
+Or you could put pdf_document or word_document
+in the initial header of the file.
+
+
+
+
+
+
Tip: Creating PDF documents
+
+
+
Creating .pdf documents may require installation of some extra
+software. The R package tinytex provides some tools to help
+make this process easier for R users. With tinytex
+installed, run tinytex::install_tinytex() to install the
+required software (you’ll only need to do this once) and then when you
+knit to pdf tinytex will automatically detect and install
+any additional LaTeX packages that are needed to produce the pdf
+document. Visit the tinytex
+website for more information.
+
+
+
+
+
+
+
+
+
Tip: Visual markdown editing in RStudio
+
+
+
RStudio versions 1.4 and later include visual markdown editing mode.
+In visual editing mode, markdown expressions (like
+**bold words**) are transformed to the formatted appearance
+(bold words) as you type. This mode also includes a
+toolbar at the top with basic formatting buttons, similar to what you
+might see in common word processing software programs. You can turn
+visual editing on and off by pressing the button in the top right corner of your
+R Markdown document.
How can I write software that other people can use?
+
+
+
+
+
+
+
+
Objectives
+
+
Describe best practices for writing R and explain the justification
+for each.
+
+
+
+
+
+
+
Structure your project folder
+
+
+
Keep your project folder structured, organized and tidy, by creating
+subfolders for your code files, manuals, data, binaries, output plots,
+etc. It can be done completely manually, or with the help of RStudio’s
+New Project functionality, or a designated package, such as
+ProjectTemplate.
+
+
+
+
+
+
Tip: ProjectTemplate - a possible
+solution
+
+
+
One way to automate the management of projects is to install the
+third-party package, ProjectTemplate. This package will set
+up an ideal directory structure for project management. This is very
+useful as it enables you to have your analysis pipeline/workflow
+organised and structured. Together with the default RStudio project
+functionality and Git you will be able to keep track of your work as
+well as be able to share your work with collaborators.
For more information on ProjectTemplate and its functionality visit
+the home page ProjectTemplate
+
+
+
+
Make code readable
+
+
+
The most important part of writing code is making it readable and
+understandable. You want someone else to be able to pick up your code
+and be able to understand what it does: more often than not this someone
+will be you 6 months down the line, who will otherwise be cursing
+past-self.
+
Documentation: tell us what and why, not how
+
+
+
When you first start out, your comments will often describe what a
+command does, since you’re still learning yourself and it can help to
+clarify concepts and remind you later. However, these comments aren’t
+particularly useful later on when you don’t remember what problem your
+code is trying to solve. Try to also include comments that tell you
+why you’re solving a problem, and what problem that
+is. The how can come after that: it’s an implementation detail
+you ideally shouldn’t have to worry about.
+
Keep your code modular
+
+
+
Our recommendation is that you should separate your functions from
+your analysis scripts, and store them in a separate file that you
+source when you open the R session in your project. This
+approach is nice because it leaves you with an uncluttered analysis
+script, and a repository of useful functions that can be loaded into any
+analysis script in your project. It also lets you group related
+functions together easily.
+
Break down problem into bite size pieces
+
+
+
When you first start out, problem solving and function writing can be
+daunting tasks, and hard to separate from code inexperience. Try to
+break down your problem into digestible chunks and worry about the
+implementation details later: keep breaking down the problem into
+smaller and smaller functions until you reach a point where you can code
+a solution, and build back up from there.
+
Know that your code is doing the right thing
+
+
+
Make sure to test your functions!
+
Don’t repeat yourself
+
+
+
Functions enable easy reuse within a project. If you see blocks of
+similar lines of code through your project, those are usually candidates
+for being moved into functions.
+
If your calculations are performed through a series of functions,
+then the project becomes more modular and easier to change. This is
+especially the case for which a particular input always gives a
+particular output.
+
Remember to be stylish
+
+
+
Apply consistent style to your code.
+
+
+
+
+
+
Keypoints
+
+
+
+
Keep your project folder structured, organized and tidy.
+
Document what and why, not how.
+
Break programs into short single-purpose functions.
+
Write re-runnable tests.
+
Don’t repeat yourself.
+
Be consistent in naming, indentation, and other aspects of
+style.
Image 1 of 1: ‘Blank plot, before adding any mapping aesthetics to ggplot().’
+
+
Figure 2
+
Image 1 of 1: ‘Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.’
+
+
Figure 3
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita, now showing the data points.’
+
+
Figure 4
+
Image 1 of 1: ‘Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time’
+
+
Figure 5
+
Image 1 of 1: ‘Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function’
+
+
Figure 6
+
+
Figure 7
+
+
Figure 8
+
+
Figure 9
+
+
Figure 10
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.’
+
+
Figure 11
+
+
Figure 12
+
Image 1 of 1: ‘Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread’
+
+
Figure 13
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.’
+
+
Figure 14
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.’
+
+
Figure 15
+
Image 1 of 1: ‘Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.’
Image 1 of 1: ‘Screenshot of the New R Markdown file dialogue box in RStudio’
+
+
Figure 2
+
+
Figure 3
+
RStudio versions 1.4 and later include visual markdown editing mode.
+In visual editing mode, markdown expressions (like
+**bold words**) are transformed to the formatted appearance
+(bold words) as you type. This mode also includes a
+toolbar at the top with basic formatting buttons, similar to what you
+might see in common word processing software programs. You can turn
+visual editing on and off by pressing the button in the top right corner of your
+R Markdown document.
+
+
+
+
diff --git a/instructor/index.html b/instructor/index.html
new file mode 100644
index 000000000..bff8efe7c
--- /dev/null
+++ b/instructor/index.html
@@ -0,0 +1,624 @@
+
+R for Reproducible Scientific Analysis: Summary and Schedule
+ Skip to main content
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ R for Reproducible Scientific Analysis
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Summary and Schedule
+
+
+
an introduction to R for non-programmers using gapminder
+data
+
The goal of this lesson is to teach novice programmers to write
+modular code and best practices for using R for data analysis. R is
+commonly used in many scientific disciplines for statistical analysis
+and its array of third-party packages. We find that many scientists who
+come to Software Carpentry workshops use R and want to learn more. The
+emphasis of these materials is to give attendees a strong foundation in
+the fundamentals of R, and to teach best practices for scientific
+computing: breaking down analyses into modular units, task automation,
+and encapsulation.
+
Note that this workshop will focus on teaching the fundamentals of
+the programming language R, and will not teach statistical analysis.
+
The lesson contains more material than can be taught in a day. The instructor notes page has some
+suggested lesson plans suitable for a one or half day workshop.
+
A variety of third party packages are used throughout this workshop.
+These are not necessarily the best, nor are they comprehensive, but they
+are packages we find useful, and have been chosen primarily for their
+usability.
+
+
+
+
+
+
Prerequisites
+
+
+
Understand that computers store data and instructions (programs,
+scripts etc.) in files. Files are organised in directories (folders).
+Know how to access files not in the working directory by specifying the
+path.
+Download
+and install RStudio. RStudio is an application (an integrated
+development environment or IDE) that facilitates the use of R and offers
+a number of nice additional features. You will need the free Desktop
+version for your computer.
+
+
diff --git a/instructor/instructor-notes.html b/instructor/instructor-notes.html
new file mode 100644
index 000000000..ee9c835ad
--- /dev/null
+++ b/instructor/instructor-notes.html
@@ -0,0 +1,641 @@
+
+
+
+
+
+R for Reproducible Scientific Analysis: Instructor Notes
+
+
+
+
+
+
+
+
+
+
+
+ Skip to main content
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ R for Reproducible Scientific Analysis
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Instructor Notes
+
+
+
Timing
+
+
+
Leave about 30 minutes at the start of each workshop and another 15
+mins at the start of each session for technical difficulties like WiFi
+and installing things (even if you asked students to install in advance,
+longer if not).
+
Lesson Plans
+
+
+
The lesson contains much more material than can be taught in a day.
+Instructors will need to pick an appropriate subset of episodes to use
+in a standard one day course.
08 Creating Publication-Quality Graphics with ggplot2 OR 13
+Dataframe Manipulation with dplyr
+
15 Producing Reports With knitr
+
+
A half day course could consist of (suggested by @karawoo):
+
+
01 Introduction to R and RStudio
+
04 Data Structures (only creating vectors with
+c())
+
05 Exploring Data Frames (“Realistic example” section onwards)
+
06 Subsetting Data (excluding factor, matrix and list
+subsetting)
+
08 Creating Publication-Quality Graphics with ggplot2
+
Setting up git in RStudio
+
+
+
There can be difficulties linking git to RStudio depending on the
+operating system and the version of the operating system. To make sure
+Git is properly installed and configured, the learners should go to the
+Options window in the RStudio application.
+
+
+Mac OS X:
+
+
Go RStudio -> Preferences… -> Git/SVN
+
Check and see whether there is a path to a file in the “Git
+executable” window. If not, the next challenge is figuring out where Git
+is located.
+
In the terminal enter which git and you will get a path
+to the git executable. In the “Git executable” window you may have
+difficulties finding the directory since OS X hides many of the
+operating system files. While the file selection window is open,
+pressing “Command-Shift-G” will pop up a text entry box where you will
+be able to type or paste in the full path to your git executable:
+e.g. /usr/bin/git or whatever else it might be.
+
+
+
+Windows:
+
+
Go Tools -> Global options… -> Git/SVN
+
If you use the Software Carpentry Installer, then ‘git.exe’ should
+be installed at C:/Program Files/Git/bin/git.exe.
+
+
+
+
To prevent the learners from having to re-enter their password each
+time they push a commit to GitHub, this command (which can be run from a
+bash prompt) will make it so they only have to enter their password
+once:
The easiest way to get the data used in this lesson during a workshop
+is to have attendees download the raw data from gapminder-data and gapminder-data-wide.
+
Attendees can use the File - Save As dialog in their
+browser to save the file.
+
Overall
+
+
+
Make sure to emphasize good practices: put code in scripts, and make
+sure they’re version controlled. Encourage students to create script
+files for challenges.
+
If you’re working in a cloud environment, get them to upload the
+gapminder data after the second lesson.
+
Make sure to emphasize that matrices are vectors underneath the hood
+and data frames are lists underneath the hood: this will explain a lot
+of the esoteric behaviour encountered in basic operations.
+
Vector recycling and function stacks are probably best explained with
+diagrams on a whiteboard.
+
Be sure to actually go through examples of an R help page: help files
+can be intimidating at first, but knowing how to read them is
+tremendously useful.
+
Be sure to show the CRAN task views, look at one of the topics.
+
There’s a lot of content: move quickly through the earlier lessons.
+Their extensiveness is mostly for purposes of learning by osmosis: so
+that their memory will trigger later when they encounter a problem or
+some esoteric behaviour.
+
Key lessons to take time on:
+
+
Data subsetting - conceptually difficult for novices
+
Functions - learners especially struggle with this
+
Data structures - worth being thorough, but you can go through it
+quickly.
+
+
Don’t worry about being correct or knowing the material
+back-to-front. Use mistakes as teaching moments: the most vital skill
+you can impart is how to debug and recover from unexpected errors.
Use the escape key to cancel incomplete commands or running code
+(Ctrl+C) if you’re using R from the shell.
+
Basic arithmetic operations follow standard order of precedence:
+
Brackets: (, )
+
+
Exponents: ^ or **
+
+
Divide: /
+
+
Multiply: *
+
+
Add: +
+
+
Subtract: -
+
+
+
Scientific notation is available, e.g: 2e-3
+
+
Anything to the right of a # is a comment, R will
+ignore this!
+
Functions are denoted by function_name(). Expressions
+inside the brackets are evaluated before being passed to the function,
+and functions can be nested.
+
Mathematical functions: exp, sin,
+log, log10, log2 etc.
+
Comparison operators: <, <=,
+>, >=, ==,
+!=
+
+
Use all.equal to compare numbers!
+
+<- is the assignment operator. Anything to the right
+is evaluate, then stored in a variable named to the left.
+
+ls lists all variables and functions you’ve
+created
+
+rm can be used to remove them
+
When assigning values to function arguments, you must use
+=.
Individual values in R must be one of 5 data types,
+multiple values can be grouped in data structures.
+
Data types
+
typeof(object) gives information about an items data
+type.
+
+
There are 5 main data types:
+
+?numeric real (decimal) numbers
+
+?integer whole numbers only
+
+?character text
+
+?complex complex numbers
+
+?logical TRUE or FALSE values
+
Special types:
+
+?NA missing values
+
+?NaN “not a number” for undefined values
+(e.g. 0/0).
+
+?Inf, -Inf infinity.
+
+?NULL a data structure that doesn’t exist
+
NA can occur in any atomic vector. NaN, and
+Inf can only occur in complex, integer or numeric type
+vectors. Atomic vectors are the building blocks for all other data
+structures. A NULL value will occur in place of an entire
+data structure (but can occur as list elements).
+
+
Basic data structures in R:
+
atomic ?vector (can only contain one type)
+
+?list (containers for other objects)
+
+?data.frame two dimensional objects whose columns can
+contain different types of data
+
+?matrix two dimensional objects that can contain only
+one type of data.
+
+?factor vectors that contain predefined categorical
+data.
+
+?array multi-dimensional objects that can only contain
+one type of data
+
Remember that matrices are really atomic vectors underneath the hood,
+and that data.frames are really lists underneath the hood (this explains
+some of the weirder behaviour of R).
Program defensively, i.e., assume that errors are going to arise,
+and write code to detect them when they do.
+
Write tests before writing code in order to help determine exactly
+what that code is supposed to do.
+
Know what code is supposed to do before trying to debug it.
+
Make it fail every time.
+
Make it fail fast.
+
Change one thing at a time, and for a reason.
+
Keep track of what you’ve done.
+
Be humble
+
Glossary
+
+
argument
+
+A value given to a function or program when it runs. The term is often
+used interchangeably (and inconsistently) with parameter.
+
+
assign
+
+To give a value a name by associating a variable with it.
+
+
body
+
+(of a function): the statements that are executed when a function runs.
+
+
comment
+
+A remark in a program that is intended to help human readers understand
+what is going on, but is ignored by the computer. Comments in Python, R,
+and the Unix shell start with a # character and run to the
+end of the line; comments in SQL start with --, and other
+languages have other conventions.
+
+
comma-separated values
+
+(CSV) A common textual representation for tables in which the values in
+each row are separated by commas.
+
+
delimiter
+
+A character or characters used to separate individual values, such as
+the commas between columns in a CSV file.
+
+
documentation
+
+Human-language text written to explain what software does, how it works,
+or how to use it.
+
+
floating-point number
+
+A number containing a fractional part and an exponent. See also: integer.
+
+
for loop
+
+A loop that is executed once for each value in some kind of set, list,
+or range. See also: while loop.
+
+
index
+
+A subscript that specifies the location of a single value in a
+collection, such as a single pixel in an image.
+
+In R, the directory(ies) where packages are
+stored.
+
+
package
+
+A collection of R functions, data and compiled code in a well-defined
+format. Packages are stored in a library and
+loaded using the library() function.
+
+
parameter
+
+A variable named in the function’s declaration that is used to hold a
+value passed into the call. The term is often used interchangeably (and
+inconsistently) with argument.
+
+
return statement
+
+A statement that causes a function to stop executing and return a value
+to its caller immediately.
+
+
sequence
+
+A collection of information that is presented in a specific order.
+
+
shape
+
+An array’s dimensions, represented as a vector. For example, a 5×3
+array’s shape is (5,3).
+
+
string
+
+Short for “character string”, a sequence of zero
+or more characters.
+
+
syntax error
+
+A programming error that occurs when statements are in an order or
+contain characters not expected by the programming language.
+
+
type
+
+The classification of something in a program (for example, the contents
+of a variable) as a kind of number (e.g. floating-point, integer), string, or something else. In R the command typeof()
+is used to query a variables type.
+
+
while loop
+
+A loop that keeps executing as long as some condition is true. See also:
+for loop.
+
Use the escape key to cancel incomplete commands or running code
+(Ctrl+C) if you’re using R from the shell.
+
Basic arithmetic operations follow standard order of precedence:
+
Brackets: (, )
+
+
Exponents: ^ or **
+
+
Divide: /
+
+
Multiply: *
+
+
Add: +
+
+
Subtract: -
+
+
+
Scientific notation is available, e.g: 2e-3
+
+
Anything to the right of a # is a comment, R will
+ignore this!
+
Functions are denoted by function_name(). Expressions
+inside the brackets are evaluated before being passed to the function,
+and functions can be nested.
+
Mathematical functions: exp, sin,
+log, log10, log2 etc.
+
Comparison operators: <, <=,
+>, >=, ==,
+!=
+
+
Use all.equal to compare numbers!
+
+<- is the assignment operator. Anything to the right
+is evaluate, then stored in a variable named to the left.
+
+ls lists all variables and functions you’ve
+created
+
+rm can be used to remove them
+
When assigning values to function arguments, you must use
+=.
Individual values in R must be one of 5 data types,
+multiple values can be grouped in data structures.
+
Data types
+
typeof(object) gives information about an items data
+type.
+
+
There are 5 main data types:
+
+?numeric real (decimal) numbers
+
+?integer whole numbers only
+
+?character text
+
+?complex complex numbers
+
+?logical TRUE or FALSE values
+
Special types:
+
+?NA missing values
+
+?NaN “not a number” for undefined values
+(e.g. 0/0).
+
+?Inf, -Inf infinity.
+
+?NULL a data structure that doesn’t exist
+
NA can occur in any atomic vector. NaN, and
+Inf can only occur in complex, integer or numeric type
+vectors. Atomic vectors are the building blocks for all other data
+structures. A NULL value will occur in place of an entire
+data structure (but can occur as list elements).
+
+
Basic data structures in R:
+
atomic ?vector (can only contain one type)
+
+?list (containers for other objects)
+
+?data.frame two dimensional objects whose columns can
+contain different types of data
+
+?matrix two dimensional objects that can contain only
+one type of data.
+
+?factor vectors that contain predefined categorical
+data.
+
+?array multi-dimensional objects that can only contain
+one type of data
+
Remember that matrices are really atomic vectors underneath the hood,
+and that data.frames are really lists underneath the hood (this explains
+some of the weirder behaviour of R).
Program defensively, i.e., assume that errors are going to arise,
+and write code to detect them when they do.
+
Write tests before writing code in order to help determine exactly
+what that code is supposed to do.
+
Know what code is supposed to do before trying to debug it.
+
Make it fail every time.
+
Make it fail fast.
+
Change one thing at a time, and for a reason.
+
Keep track of what you’ve done.
+
Be humble
+
Glossary
+
+
argument
+
+A value given to a function or program when it runs. The term is often
+used interchangeably (and inconsistently) with parameter.
+
+
assign
+
+To give a value a name by associating a variable with it.
+
+
body
+
+(of a function): the statements that are executed when a function runs.
+
+
comment
+
+A remark in a program that is intended to help human readers understand
+what is going on, but is ignored by the computer. Comments in Python, R,
+and the Unix shell start with a # character and run to the
+end of the line; comments in SQL start with --, and other
+languages have other conventions.
+
+
comma-separated values
+
+(CSV) A common textual representation for tables in which the values in
+each row are separated by commas.
+
+
delimiter
+
+A character or characters used to separate individual values, such as
+the commas between columns in a CSV file.
+
+
documentation
+
+Human-language text written to explain what software does, how it works,
+or how to use it.
+
+
floating-point number
+
+A number containing a fractional part and an exponent. See also: integer.
+
+
for loop
+
+A loop that is executed once for each value in some kind of set, list,
+or range. See also: while loop.
+
+
index
+
+A subscript that specifies the location of a single value in a
+collection, such as a single pixel in an image.
+
+In R, the directory(ies) where packages are
+stored.
+
+
package
+
+A collection of R functions, data and compiled code in a well-defined
+format. Packages are stored in a library and
+loaded using the library() function.
+
+
parameter
+
+A variable named in the function’s declaration that is used to hold a
+value passed into the call. The term is often used interchangeably (and
+inconsistently) with argument.
+
+
return statement
+
+A statement that causes a function to stop executing and return a value
+to its caller immediately.
+
+
sequence
+
+A collection of information that is presented in a specific order.
+
+
shape
+
+An array’s dimensions, represented as a vector. For example, a 5×3
+array’s shape is (5,3).
+
+
string
+
+Short for “character string”, a sequence of zero
+or more characters.
+
+
syntax error
+
+A programming error that occurs when statements are in an order or
+contain characters not expected by the programming language.
+
+
type
+
+The classification of something in a program (for example, the contents
+of a variable) as a kind of number (e.g. floating-point, integer), string, or something else. In R the command typeof()
+is used to query a variables type.
+
+
while loop
+
+A loop that keeps executing as long as some condition is true. See also:
+for loop.
+