03-cleaning.Rmd

# Data transformation

```{r, message=FALSE, warning=FALSE}
library(jsonlite)
library(dplyr)
library(knitr)
library(tokenizers)
library(syuzhet)
library(rjson)
```

```{r, message=FALSE, warning=FALSE,echo=FALSE, results='hide'}
json_data <- stream_in(file("data/demo_df.json"))
```

## Feature Selection

There are 35 features in the data set and more sub-features within a feature. After careful consideration, we chose 7 features to construct our data set for analysis. The following is a table describing all selected variables.

```{r}
#7 features were originally selected
df <- json_data[,c("created_at","id_str","text","user","place","extended_tweet")]

#`user_id` and `follower_count` from `user`
user_new <- df[, c("user")] %>% subset(select = c("id_str", "followers_count"))
colnames(user_new) <- c("user_id", "followers_count")

#`full_name` from `place` which is the name of a location
place_new <- df[,c("place")] %>% subset(select = c(full_name))
colnames(place_new) <- c("location")

#`full_text` from `extended_tweet`
full_text <- df[,c("extended_tweet")] %>% subset(select = c(full_text))
colnames(full_text) <- c("full_text")

#Construct the new dataframe
df_final <- df %>% subset(select = -c(user, place, extended_tweet)) %>% cbind(user_new) %>% cbind(place_new) %>% cbind(full_text)

#Generate table

col_names <- colnames(df_final)

description <- c("creation time of tweet", "tweet id", "truncated tweet of length<140", "user id", "number of followers of user", "location of tweet", "full text of tweet")

knitr::kable(cbind(col_names,description), col.names = c("Features","Description"),caption="Feature Overview")
```

Of the 7 features, `user_id` and `follower_count` are sub-features from `user`. They captures the id and the number of followers of a user respectively. `location` is a sub-feature from `place`. It is a user-identified location, which they can put anything on it. `full_text` is a sub-feature from `extended tweet`. It captures tweets that has over 140 characters. We did not keep the feature `geo` as we found out that it has a missing percentage of almost 100%. It will show up in the missing value section to support out decision.

After selecting our basic features, we noticed that texts are stored in two different features. If the tweet is longer than 140 characters, the tweet would be truncated and put into `text`. The original tweet would be put into `full_text`. To make our analysis more convenient, we add a new variable called `original_text` to store the tweets.

```{r}
df_final <- df_final %>%
  mutate(original_text = case_when(
    is.na(df_final$full_text)==TRUE ~ df_final$text,
    is.na(df_final$full_text)==FALSE ~ df_final$full_text
    ))
```

## Tokenization

Next, we conduct tokenization. This a process that will help us with other natural language processing analysis. Since there are some special characters like hashtags and usernames that might otherwise be stripped away using other tokenizer, we use the specific tokenize_tweets() function from the tokenizers library. Below is a demonstration for how it tokenizes one of the tweets.

```{r}
print(df_final$original_text[1])
print(tokenizers::tokenize_tweets(df_final$original_text[1]))
```

We then add `word_tokens` as a feature to our dataset.

```{r}
df_final <- df_final %>%
  mutate(word_tokens = tokenizers::tokenize_tweets(df_final$original_text))
```

## Sentiment

Tweets have sentiments and here we try to classify a tweet as positive, negative or netural.

Before extracting sentiments from the tweets, we need to firstly clean the text such that it does not contain any special characters such as hashtags, `\`,`@`, website links, etc. Some special character might affect the accuracy of the sentiment score.

```{r}
cleaned_text <- gsub('http\\S+\\s*',"",df_final$original_text)
cleaned_text <- gsub('https\\S+\\s*',"",cleaned_text)
cleaned_text <- gsub("#","",cleaned_text)
cleaned_text <- gsub("@","",cleaned_text)
```

```{r}
cleaned_text <- cleaned_text %>% as.data.frame()
colnames(cleaned_text) <- c("cleaned_text")
df_final <- cbind(df_final, cleaned_text)
```

Below is a comparison between original text and the cleaned text. Typically `@` and website links are removed. After cleaning all the tweets, they were stored in a new feature `cleaned_text` in our dataset.

```{r}
print(df_final$original_text[220])
print(df_final$cleaned_text[220])
```


```{r}
# too slow, abandon for now
# We then extract the number of words related to each emotion using the NCR lexicon.
# df_new$emotion_word_counts <- syuzhet::get_nrc_sentiment(df_new$cleaned_text)
```

Now, we determine the sentiment score for each tweet using library "syuzhet", which is a custom sentiment dictionary developed in the Nebraska Literary Lab.The sentiment scores are stored in a new feature `sentiment_score`.

```{r}
df_final$sentiment_score <- syuzhet::get_sentiment(df_final$cleaned_text)
```

We then classify each tweet into three categories: positive(score>0), neutral(socre=0), and negative(score<0). We add `sentiment` as feature into our dataset to capture the sentiment category of a tweet.

```{r}
df_final <- df_final %>%
  mutate(sentiment = case_when(
    df_final$sentiment_score>0 ~ "positive",
    df_final$sentiment_score<0 ~ "negative",
    df_final$sentiment_score==0 ~ "neutral")
         )
```

## Summary of added features

We added 5 new features into the dataset after some processing, summing to a total of 12 features. Those new features will help us better conduct analysis and visualizations. Below is a table describing all added features.

```{r}
col_names <- colnames(df_final)[8:12]
description <- c("original text", "a list of word tokens", "text after removing special characters", "sentiment scores", "sentiment of tweet: positive, neutral, or negative")

knitr::kable(cbind(col_names,description), col.names = c("variables", "descriptions"), caption = "Additional Feature Overview")
```


```{r}
# function to transform .json file raw data given file path
data_tranform <- function(filepath){
  #load data from path
  json_data <- stream_in(file(filepath))
  #json_data <- read.csv(filepath)
  df <- json_data[,c("created_at","id_str","text","geo","user","place","extended_tweet")]
  
  
  # extract new variables and replace the old ones
  user <- df[,"user"]%>%as.data.frame() #%>% subset(select = -c(withheld_in_countries) )
  user_new <- user[, c("id_str", "followers_count")]
  colnames(user_new) <- c("user_id", "user_follower_count")

  place_new <- df[,c("place")] %>% subset(select = c(full_name))
  colnames(place_new) <- c("location")

  full_text <- df[,c("extended_tweet")] %>% subset(select = c(full_text))
  colnames(full_text) <- c("full_text")

  df_new <- df %>% subset(select = -c(user, place, extended_tweet)) %>% cbind(user_new) %>% cbind(place_new) %>% cbind(full_text)
  
  # add original_text column
  df_new <- df_new %>%
  mutate(original_text = case_when(
    is.na(df_new$full_text)==TRUE ~ df_new$text,
    is.na(df_new$full_text)==FALSE ~ df_new$full_text
    ))
  
  # tokenize words
  #df_new$word_tokens <- tokenizers::tokenize_tweets(df_new$original_text)
  #df_new$word_tokens <- tokenizers::tokenize_words(df_new$original_text)
  
  # add clean_text column
  cleaned_text <- gsub('http\\S+\\s*',"",df_new$original_text)
  cleaned_text <- gsub('https\\S+\\s*',"",cleaned_text)
  cleaned_text <- gsub("#","",cleaned_text)
  cleaned_text <- gsub("@","",cleaned_text)
  
  cleaned_text <- cleaned_text %>% as.data.frame()
  colnames(cleaned_text) <- c("cleaned_text")
  df_new <- cbind(df_new, cleaned_text)

  #extract sentiment
  df_new$sentiment_score <- syuzhet::get_sentiment(df_new$cleaned_text)
  df_new <- df_new %>%
  mutate(sentiment = case_when(
    df_new$sentiment_score>0 ~ "positive",
    df_new$sentiment_score<0 ~ "negative",
    df_new$sentiment_score==0 ~ "neutral",
    is.na(df_new$sentiment_score) ~ "neutral"
         ))
  
  return(df_new)
}
```