forked from DaveSheets-Merrimack/DSE5002_Module2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPairExercise_Matrices_Lists.Rmd
353 lines (227 loc) · 7.84 KB
/
PairExercise_Matrices_Lists.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
---
title: "DataFrames_and_Lists"
author: "HDS"
date: "2024-06-21"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Data Frames and Lists
A dataframe in R is pretty much an R version of an SQL data table
Modern computer languages borrow concepts from one another constantly.
Python has a library called Pandas that implements a dataframe in Python that
works me in a very similar way.
In a dataframe (or SQL data table) each row is an "observation" or an "event"
an "individual item".
All the variable values on a given row belong to the same observation
The columns of a dataframe are different variables, one per column. The
columns do not have to be the same type of data as the other columns.
Dataframes look much like Excel sheets, but the structure is more rigid.
Typically you will be loading a dataframe from some data source, maybe a
database, a website or a table created in Excel.
Right now we will look at an example data set that comes with R the "Motor Trend
cars" data set mtcars. This is a built in "lab rat" data set that comes with
R
We will load the data first
```{R}
data(mtcars)
```
We can find out what is in it using STR
```{R}
str(mtcars)
```
Like a matrix, the data frame has rows and columns we can index by number
There are 32 rows here and 11 columns
We can look at the first 6 rows of the data frame, this is a nice way to
see what is in a data frame
```{R}
head(mtcars)
```
We can see the last 6 lines using the tail() function, or the whole
dataframe using the View() function
It is a bad idea to depend too heavily on the View() option.
This is a small data set, 32 rows by 11 columns, so it is easy to understand
what is happening, but when you get into the hundreds or thousands of rows,
View() is really useful anymore
```{R}
View(mtcars)
```
On the other hand, I use str(), head() and colnames() constantly
We can get lists of the row and column names
```{R}
rownames(mtcars)
```
```{R}
colnames(mtcars)
```
#Summary
Most R variables will show you some interesting information when you
use the summary() function
```{R}
summary(mtcars)
```
#indexing a data frame
Indexing works the same way it does for matrices or vectors
Index with integers
```{R}
mtcars[1,1:5]
```
```{R}
mtcars[2,]
```
```{R}
mtcars[,10]
```
```{r}
mtcars["Mazda RX4","hp"]
```
#Working with a column
We can work with a column by giving the name of the dataframe, $ and the column
name
```{R}
mtcars$wt
```
#Adding a column
R does not "protect" data frames, we can just add a new column by
giving it a column name and a set of values- this is a bit alarming if you
are used to managing SQL databases :)
Example:
In cars, the horsepower to Weight ratio is important (it is for humans too)
We can add this variable to our data frame
```{R}
mtcars["Power_2_Weight"]=mtcars$hp/mtcars$wt
```
We can now ask, which car has the highest power to weight ratio?
```{R}
maxPW_index=which.max(mtcars$Power_2_Weight)
mtcars[maxPW_index,]
```
It might be sort of interesting to see the top three power to weight vehicles,
not just the largest.
We'd like to sort the data frame by Power_2_weight ratio
We use order to get the indexing to do this
```{R}
p2w_index=order(mtcars$Power_2_Weight,decreasing = TRUE)
head(mtcars[p2w_index,])
```
Look at those ideas- think of how you might use them on your own problem
to work with data
#Plotting
We will see more sophisticated plotting, but we can easily do simple
plots using the plot() or hist() functions
```{R}
plot(mtcars$Power_2_Weight,mtcars$qsec)
```
What hypothesis does the plot above address?
#Histograms are one way to look at distributions of values
```{r}
hist(mtcars$Power_2_Weight)
```
#Question/Action
State a hypothesis about Power_2_Weight and mpg (miles per gallon). Create
a graph that tests the hypothesis
#Creating data frames in R
We can create data frames from within R in several ways
Lets declare some matched arrays, maybe describing some people
```{R}
last_name=c("Smith","Jones","Alvarez")
first_name=c("Bob","Jane","Angel")
age=c(30,27,32)
```
We can put these in a dataframe
```{R}
customer_df=data.frame(last=last_name, first=first_name,Age=age)
head(customer_df)
```
#Utilities
There are tools for reading data out of a csv (comma seperated value) or excel
file and into a data frame, or to connect to a database and convert an SQL
data table into a data frame.
There are also tools to write a data file to disk
It is often convenient to save results as data.frames, since they are easy to
work with in other software packages if you save them as csv files
#Loading a DataFrame from a File
We need to specify the name of the file, with the file path included.
The easiest way to get the full file path correct is to use the
file.choose() function to browse for the file.
Here, we browse for the file and save it to a variable called "infile"
One thing to be careful about is that file.choose() will not knit,
use it in your work, but copy the file name to a variable and then
"rem out" the file.choose() operation by putting # in front of it
We will use the Boston Assessment Roll again
```{R}
#rem this out by putting # in front of the command below
file.choose()
```
```{R}
# cut and paste using Ctrl-C and Ctrl-V to assign the full path to infile
infile="C:\\Users\\hdavi\\Dropbox\\Merrimack_Data_Science\\DSE5002_R+Python\\DSE5002_Module_1\\Boston_Assessment_Roll_2024.csv"
```
now use read.csv() to load the assessment roll into a data frame
```{R}
boston_roll=read.csv(infile)
```
```{R}
head(boston_roll)
```
There are R functions to import excel files and many other types of data
Google "R import excel sheet" to learn how to do this
#Saving a dataframe to disk
This will let us use it somewhere else later
We do have to specify the whole path name, meaning the directory name
plus the file name.
If you don't specify a full path, R saves the file to the current directory
YOu can see the current directory in the Files window to the right in
RStudio
you can see the current directory using getwd(), which means get working
directory
```{R}
getwd()
```
You can change the directory using the "Session" menu in RStudio
I'll change to the desktop using Session-Set Working Directory
```{R}
getwd()
```
Let's save our customer_df data frame
specify the file name
```{r}
#I like to put the date in the name of an output file, it helps in long
#projects
outfile="customer_df_6_21_2024.csv"
```
```{R}
write.csv(customer_df,outfile)
```
Open this file in excel and see if it matches what you expect
#Lists
R can store things in lists, which can contain different types of objects
```{R}
a=list("Bob", 1, as.integer(2),TRUE,c(1,3,5,7,9))
```
Lists can store different types of things
We can index them
```{R}
a[1]
```
we can assign names in a list as well
```{R}
a=list(name="Bob", purchases=1, category=as.integer(2),active=TRUE,items_purchased=c(1,3,5,7,9))
a
```
We can access these by name
```{R}
a$name
```
Notice that one of the things in a, items_purchased, is itself a vector.
You cannot store vector values within a single variable such as
"items_purchased" in a data frame,
You can't store this easily in a single SQL dataframe either
You can in a newer NOSQL database like MongoDB, or in a user defined class in R
More on that later....
We can't really do math directly on a list, since they often have mixtures of
data types in them.
Right now, dataframes will be more usefull to you
If you get working on complex problems, or start using NOSQL databases, you will
be using lists more often.