-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
276 lines (191 loc) · 9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
title: "HEADS ACCELERATOR - Module 3"
author:
- "Francisco Bischoff"
date: "`r format(Sys.Date(), '- %d %b %Y')`"
editor_options:
markdown:
mode: gfm
output:
md_document:
variant: gfm
html_preview: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
fig.path = ".assets/figures/README-",
echo = TRUE,
message = FALSE,
warning = FALSE,
collapse = TRUE
)
```
<img src=".assets/figures/logo.png" align="right" style="float:right;"/>
<!-- start badges -->
<img src="https://github.com/HEADS-UPorto/${{env.REPOSITORY_SLUG}}/workflows/Building%20Binder/badge.svg" alt="Building Binder"/>
<a href="https://mybinder.org/v2/gh/HEADS-UPorto/Rstudio_Env/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252FHEADS-UPorto%252F${{env.REPOSITORY_SLUG}}%26targetPath%3Dheads%26urlpath%3Drstudio%252F%26branch%3Dmaster"><img src="https://mybinder.org/badge_logo.svg" alt="Launch Binder"/></a>
<img src="https://github.com/HEADS-UPorto/${{env.REPOSITORY_SLUG}}/workflows/GitHub%20Classroom%20Workflow/badge.svg?branch=master&event=push" alt="GitHub Classroom Workflow"/>
<!-- end badges -->
## Syllabus
This module will cover Composite Data Types:
- Vectors
- Factors
- Matrices
- Lists
- Data Frames
## Lessons
### Module 3.1: Vectors
Starting from the last class, we see that the variables only contained one value.
If we want to store several values in a variable, we use the `c()` function.
This functions means "concatenate", or link (things) together in a chain or series.
That is when the Vectors come to play.
A vector is a combination of values of the same **Basic** type, as for example:
``` r
my_vector <- c(1, 2, 3) # this is a vector of `numeric` with length 3
my_other_vector <- c("one", "two", "three") # this is a vector of `character`
```
Vectors can only store values of the **same** type.
If you try to combine different types, R will try its best to find a type that can accommodate all data, so be careful.
For example:
```{r combine_diff_types, collapse=TRUE}
# here we have a vector of integers
my_int <- c(1L, 2L, 3L)
typeof(my_int)
# here we have a vector (yes! this is a vector of length 1) with one string
my_char <- "some string"
typeof(my_char)
# now we combine them
my_combined_vector <- c(my_int, my_char)
typeof(my_combined_vector)
# We can't convert a string into an integer, but we can convert an integer into a string!
my_combined_vector
```
Vectors are flat, they have no dimensions (you will understand later), and to get its size, you **must** use the `length()` function.
### Module 3.2: Factors
Factor is a type that encodes categorical data.
Its underlying structure is composed by integers, but linked to an attribute composed by strings, which are the **levels** of the factors.
The factor is a Composite type because it links integers and strings intrinsically to simbolize "categories" while keeping a low memory usage and computational efficiency.
Check this out:
```{r factors, collapse=TRUE}
# first let's create a vector with factors
my_factor <- factor(c("male", "female", "female", "male", "male", "male"))
my_factor
# now, let's inspect the structure of this variable (can you see the integers?)
str(my_factor)
# now let's go further deep, let's remove the class of this object and see the structure
# It is a vector of integers, and has an attribute called "levels" that contains two strings
str(unclass(my_factor))
```
You can use the function `levels()` to see which "categories" (i.e. levels) the factor contains, or just `nlevels()` get the count of levels.
### Module 3.3: Matrices
A matrix, as you remember from mathematics, its a "table-like" structure, with two dimensions (remember what I said about vectors?).
You may ask, what about multidimensional matrices?
In R, there is another type that can have more than two dimensions, but let's keep it simple for now.
You can check the dimensions of a matrix with the `dim()` function:
```{r dim_matrix, collapse=TRUE}
my_matrix <- matrix(30:16, 5, 3) # create a matrix with 5 rows and 3 columns
my_matrix
dim(my_matrix) # in R, rows always comes first
```
We can also check the specifically the number of rows with `nrow()` and columns with `ncol()`:
```{r nrow_ncol, collapse=TRUE}
nrow(my_matrix)
ncol(my_matrix)
```
Moreover, we can check the whole size of our matrix with `length()`.
This will give us the total number of elements that out matrix contains.
```{r matrix_length, collapse=TRUE}
length(my_matrix)
```
This is important to know, because we have two ways to access matrices elements, giving the row-column value, or its flat index (we will talk about accessing elements later), like this:
```{r flat_index, collapse=TRUE}
# second row, third column
my_matrix[2, 3]
# or, the 12th element
my_matrix[12]
```
As for vectors, matrices also can only contain one **Basic** type.
Another way to create a matrix is binding vectors by column with `cbind()` or by row with `rbind()`:
```{r bind, collapse=TRUE}
my_vector <- c(1, 2, 3)
# when combining vectors, they must have the same length.
my_binded_matrix <- rbind(my_vector, my_vector)
my_binded_matrix
my_other_matrix <- cbind(my_vector, my_vector)
my_other_matrix
```
### Module 3.4: Lists
A List is one of the most versatile objects in R.
It can contain different types of data, like a number, a vector, functions or even another list, forming a tree-like structure.
The elements can have different lengths.
Here is an example:
```{r build_a_tree, collapse=TRUE}
my_vector <- c(1, 2, 3)
my_list <- list(test = "string", flag = TRUE, m = matrix(0, 3, 2))
my_tree <- list(a = 1, b = my_list, 3, c = my_vector, f = function(){print("hello")})
str(my_tree)
```
Lists are dynamic, and this object is efficient for adding and removing elements.
To remove an element, just set its value to `NULL`:
```{r remove_element, collapse=TRUE}
# to access an element, you can use the $ (dollar) symbol
my_tree$c <- NULL
str(my_tree)
```
An important thing is that, as you can see above, there is an element that has no "name" associated with it (the num 3, now at the 3rd position of this list).
To access this element, you have to use its position, not its name.
This can be complicated if you add/remove elements.
So its a good practice to give meaningful names to the elements.
Another important difference is that lists has two levels of accessing its elements.
You can access the **element** or **the value of the element**.
This sounds complicated, but let's see a brief example:
```{r access_element, collapse=TRUE}
# here we are accessing the element:
my_tree[3] # here we are using the index, because this element has no name
# it still a list, with one element
class(my_tree[3])
# here we are acessing the value of the element
my_tree[["a"]] # this is the same as using the $ (dollar) sign
# now we just have the numeric, no more lists
class(my_tree[["a"]])
```
The size of a list depends on what you want to measure:
```{r list_size, collapse=TRUE}
# how many elements there are in this list?
str(my_tree)
length(my_tree) # this is the number of the top-level elements
# I want to know how many values are stored inside the whole object
length(unlist(my_tree)) # we don't count the nodes
```
Just for fun, let's call the function we put inside the list:
```{r, function_list, collapse=TRUE}
my_tree$f()
```
### Module 3.5: Data Frames
Finally, we will meet the Data Frame.
This type shares many properties of matrices and lists:
- It has two dimensions. All lines must have the same number of columns, and all columns must have the same number of lines.
- The values can be accessed using the matrix syntax: [row, column]
- The columns have names, and we can access them using the $ (dollar) sign
- Each column can have a different data type. The rows inside a column are from the same data type (there are weird exceptions, but let's not think about that).
Data Frames are usually the output of functions that imports external data, as `read_csv()` for example.
We can also create a data frame using the `data.frame()` function:
```{r create_data_frame, collapse=TRUE}
my_data <- data.frame(name = c("John", "Mary", "Sam", "Peter"), age = c(30, 25, 40, 37))
my_data
```
Data Frames are useful for data science because:
- Ensures that each variable (column) has an observation (row), or at least registers it as missing
- Ensures that each column has a name associated (the variable name)
- Are a stable concept across multiple languages ([read more](https://plateau-workshop.org/assets/papers-2019/10.pdf))
As for matrices, use `dim()`, `nrow()` and `ncol()` to inspect the shape of the data frame.
Using `length()` will return the total count of elements the data frame has (usually rows x cols).
For retrieving the names of all variables (columns) use `colnames()`:
```{r colnames, collapse=TRUE}
colnames(my_data)
```
### Module 3.6: Write a simple R code
Your third task will be:
1. Open the file `assessment.R` that is already in the project, and start from there.
2. Complete the code following the instructions in the comments