-
Notifications
You must be signed in to change notification settings - Fork 0
/
99-mutate-challenge.Rmd
181 lines (132 loc) · 5.61 KB
/
99-mutate-challenge.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
title: "Multi-column, multi-value mutate challenge"
output:
html_document:
df_print: paged
knit: (function(inputFile, encoding) { rmarkdown::render(inputFile, encoding = encoding, output_dir = "docs") })
---
```{r setup, echo=T, results='hide', message=F, warning=F}
library(tidyverse)
library(data.table)
```
I'm trying to define an efficient solution for the following problem:
Create a new column with a T/F value based on looking through multiple columns for a collection of values. If the value is present in any of the columns, make it TRUE. If not, make it FALSE.
I'm using starwars data to define the problem, which might make this appear like an illogial operation. In my real data, the columns I'm looking through contain similar strings. The data is too large (about 1Mx167) to reshape, so I'm trying to find a function or loop that will allow me to cycle through each column to look for the list of values. A function or loop would also be welcome because I have to do this with different columns and values at different times. (Otherwise, I would just use multiple and long `case_when` statements.)
For this example, I want to set the `my_col` value to be TRUE if any of these values ("Luke Skywalker", "Darth Vader", "blue") appear in EITHER the `name` or `eye_color` columns.
```{r}
starwars %>%
select(name, eye_color)
```
## Create my comparison lists
```{r}
list_cols <- starwars %>% select(name, eye_color) %>% names()
list_values <- c("Luke Skywalker", "Darth Vader", "blue")
```
## A filter to show the desired rows to be deemed TRUE
These are the cols I'm trying to capture. There should be 20.
```{r}
starwars %>%
select(name, eye_color) %>%
filter(name %in% list_values | eye_color %in% list_values)
```
## The long case_when version
This technically works, but there are many more than two columns to consider in my target data.
```{r}
starwars %>%
mutate(
my_col = case_when(
name %in% list_values ~ TRUE,
eye_color %in% list_values ~ TRUE,
TRUE ~ FALSE
)
) %>%
count(my_col)
```
## Loop attempt within case_when
Without the `TRUE ~ FALSE` setting, this provides an error: "No cases provided". Something is wrong with my loop, or that is not a good place for one.
```{r}
starwars %>%
mutate(
my_col = case_when(
for (colitem in list_cols) {
colitem %in% list_values ~ TRUE
},
TRUE ~ FALSE
)
) %>%
count(my_col)
```
## A different angle using rowSums
This rowSums attempt must aggregate the values in some way, or only affect the first or last value found, which makes some kinda of sense given `rowSums`. Other than there are 87 rows returned.
```{r}
starwars %>%
mutate(
my_col = rowSums(select(., list_cols) == list_values) > 0
) %>%
count(my_col)
```
## Function attempt
This function works for one column, but not many columns. It needs some kind of loop to evaluate each column. I have not been able to figure out how to do a loop within the `case_when` function.
```{r}
mutate_function <- function(fdata, fvalue) {
fdata %>%
select(name, eye_color) %>%
mutate(my_col = case_when(
eye_color %in% list_values ~ TRUE,
TRUE ~ FALSE
))
}
mutate_function(starwars, list_values)
```
## Let's try ifelse
Stacking multiple if/else does not work because previously changed values will be changed back since the previous condition is no longer evaluated.
```{r}
starwars %>%
select(name, eye_color) %>%
mutate(
my_col = ifelse(
name %in% list_values, TRUE, FALSE
)
) %>%
mutate(
my_col = ifelse(
eye_color %in% list_values, TRUE, FALSE
)
) %>%
count(my_col)
```
## Sharon Machlis function
Sharon Machlis offered this solution on News Nerdery and I think it will work, though I've yet to test it on the larger dataset. I really like the reusability of this solution. I've edit it for succinctness and to fit the flavor of this challenge.
> I think I've figured out a flaw in this, and again because I have a poor example. My dataframe will have more columns than just those considered in the filter and this spits out just those considered within the filter. Also, it is super slow.
```{r}
list_cols <- starwars %>% select(name, eye_color) %>% names()
list_values <- c("Luke Skywalker", "Darth Vader", "blue")
multi_col_mutate <- function(the_data, the_cols, the_values) {
myresults <- the_data[the_cols]
numcol <- ncol(myresults)
numrow <- nrow(myresults)
myresults$mycol <- NA
for (i in 1:numrow) {
myresults$mycol[i] <- any(as_vector(myresults[i,1:numcol]) %in% the_values)
}
return(myresults)
}
newdf <- multi_col_mutate(starwars, list_cols, list_values)
newdf %>% count(mycol)
```
## Sharon Machlis data.tables idea
I like the promise of speed that Sharon mentions in her News Nerdery post, but in this case I need to be able to pass in the list of columns, as there are 25+ of them in my real data and they change depending on use case. I need to learn how to do that.
Clearly, I need to study [data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) more.
More info from Sharon
the_data_subset <- the_data[, ..my_cols]
my_results <- the_data_subset[, mycol := any(as_vector(.SD) %chin% my_values), by = seq_len(nrow(the_data))]
full_data <- merge(the_data, my_results, by.x = my_cols, by.y = my_cols, all.x = TRUE, all.y = TRUE)
```{r}
dt_my_values <- c("Luke Skywalker", "Darth Vader", "blue")
dt_data <- as.data.table(starwars)
# dt_data[name %in% dt_my_values]
# dt_data <- dt_data[, .(name, eye_color)]
dt_results <- dt_data[, mycol := any(as_vector(.SD) %chin% dt_my_values)]
dt_results
dt_results %>% count(mycol)
```