99-mutate-challenge.Rmd

---
title: "Multi-column, multi-value mutate challenge"
output:
  html_document:
    df_print: paged
knit: (function(inputFile, encoding) { rmarkdown::render(inputFile, encoding = encoding, output_dir = "docs") })
---

```{r setup, echo=T, results='hide', message=F, warning=F}
library(tidyverse)
library(data.table)
```

I'm trying to define an efficient solution for the following problem:

Create a new column with a T/F value based on looking through multiple columns for a collection of values. If the value is present in any of the columns, make it TRUE. If not, make it FALSE.

I'm using starwars data to define the problem, which might make this appear like an illogial operation. In my real data, the columns I'm looking through contain similar strings. The data is too large (about 1Mx167) to reshape, so I'm trying to find a function or loop that will allow me to cycle through each column to look for the list of values. A function or loop would also be welcome because I have to do this with different columns and values at different times. (Otherwise, I would just use multiple and long `case_when` statements.)

For this example, I want to set the `my_col` value to be TRUE if any of these values ("Luke Skywalker", "Darth Vader", "blue") appear in EITHER the `name` or `eye_color` columns.


```{r}
starwars %>% 
  select(name, eye_color)
```


## Create my comparison lists

```{r}
list_cols <- starwars %>% select(name, eye_color) %>% names()
list_values <- c("Luke Skywalker", "Darth Vader", "blue")
```

## A filter to show the desired rows to be deemed TRUE

These are the cols I'm trying to capture. There should be 20.

```{r}
starwars %>% 
  select(name, eye_color) %>% 
  filter(name %in% list_values | eye_color %in% list_values)
```


## The long case_when version

This technically works, but there are many more than two columns to consider in my target data.

```{r}
starwars %>% 
  mutate(
    my_col = case_when(
      name %in% list_values ~ TRUE,
      eye_color %in% list_values ~ TRUE,
      TRUE ~ FALSE
    )
  ) %>%
  count(my_col)
```

## Loop attempt within case_when

Without the `TRUE ~ FALSE` setting, this provides an error: "No cases provided". Something is wrong with my loop, or that is not a good place for one.

```{r}
starwars %>% 
  mutate(
    my_col = case_when(
      for (colitem in list_cols) {
        colitem %in% list_values ~ TRUE
      },
      TRUE ~ FALSE
    )
  ) %>%
  count(my_col)
```

## A different angle using rowSums

This rowSums attempt must aggregate the values in some way, or only affect the first or last value found, which makes some kinda of sense given `rowSums`. Other than there are 87 rows returned.

```{r}
starwars %>% 
  mutate(
      my_col = rowSums(select(., list_cols) == list_values) > 0
    ) %>% 
  count(my_col)
```

## Function attempt

This function works for one column, but not many columns. It needs some kind of loop to evaluate each column. I have not been able to figure out how to do a loop within the `case_when` function.

```{r}
mutate_function <- function(fdata, fvalue) {
  fdata %>% 
    select(name, eye_color) %>% 
    mutate(my_col = case_when(
      eye_color %in% list_values ~ TRUE,
      TRUE ~ FALSE
    ))
}

mutate_function(starwars, list_values)
```


## Let's try ifelse

Stacking multiple if/else does not work because previously changed values will be changed back since the previous condition is no longer evaluated.

```{r}

starwars %>% 
  select(name, eye_color) %>% 
  mutate(
    my_col = ifelse(
      name %in% list_values, TRUE, FALSE
    )
  ) %>% 
  mutate(
    my_col = ifelse(
      eye_color %in% list_values, TRUE, FALSE
    )
  ) %>% 
  count(my_col)

```

## Sharon Machlis function

Sharon Machlis offered this solution on News Nerdery and I think it will work, though I've yet to test it on the larger dataset. I really like the reusability of this solution. I've edit it for succinctness and to fit the flavor of this challenge.

> I think I've figured out a flaw in this, and again because I have a poor example. My dataframe will have more columns than just those considered in the filter and this spits out just those considered within the filter. Also, it is super slow.

```{r}
list_cols <- starwars %>% select(name, eye_color) %>% names()
list_values <- c("Luke Skywalker", "Darth Vader", "blue")

multi_col_mutate <- function(the_data, the_cols, the_values) {
  myresults <- the_data[the_cols]
  numcol <- ncol(myresults)
  numrow <- nrow(myresults)
  myresults$mycol <- NA
    for (i in 1:numrow) {
      myresults$mycol[i] <- any(as_vector(myresults[i,1:numcol]) %in% the_values)
    }
  return(myresults)
}

newdf <- multi_col_mutate(starwars, list_cols, list_values)

newdf %>% count(mycol)
```


## Sharon Machlis data.tables idea

I like the promise of speed that Sharon mentions in her News Nerdery post, but in this case I need to be able to pass in the list of columns, as there are 25+ of them in my real data and they change depending on use case. I need to learn how to do that.

Clearly, I need to study  [data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) more.

More info from Sharon

the_data_subset <- the_data[, ..my_cols]
my_results <- the_data_subset[, mycol := any(as_vector(.SD) %chin% my_values), by = seq_len(nrow(the_data))]
full_data <- merge(the_data, my_results, by.x = my_cols, by.y = my_cols, all.x = TRUE, all.y = TRUE)

```{r}
dt_my_values <- c("Luke Skywalker", "Darth Vader", "blue")

dt_data <- as.data.table(starwars)
# dt_data[name %in% dt_my_values]
# dt_data <- dt_data[, .(name, eye_color)]
dt_results <- dt_data[, mycol := any(as_vector(.SD) %chin% dt_my_values)]
dt_results
dt_results %>% count(mycol)
```