generated from jtr13/EDAVtemplate
-
Notifications
You must be signed in to change notification settings - Fork 3
/
03-cleaning.Rmd
194 lines (142 loc) · 7.78 KB
/
03-cleaning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# Data transformation
```{r, message=FALSE, warning=FALSE}
library(jsonlite)
library(dplyr)
library(knitr)
library(tokenizers)
library(syuzhet)
library(rjson)
```
```{r, message=FALSE, warning=FALSE,echo=FALSE, results='hide'}
json_data <- stream_in(file("data/demo_df.json"))
```
## Feature Selection
There are 35 features in the data set and more sub-features within a feature. After careful consideration, we chose 7 features to construct our data set for analysis. The following is a table describing all selected variables.
```{r}
#7 features were originally selected
df <- json_data[,c("created_at","id_str","text","user","place","extended_tweet")]
#`user_id` and `follower_count` from `user`
user_new <- df[, c("user")] %>% subset(select = c("id_str", "followers_count"))
colnames(user_new) <- c("user_id", "followers_count")
#`full_name` from `place` which is the name of a location
place_new <- df[,c("place")] %>% subset(select = c(full_name))
colnames(place_new) <- c("location")
#`full_text` from `extended_tweet`
full_text <- df[,c("extended_tweet")] %>% subset(select = c(full_text))
colnames(full_text) <- c("full_text")
#Construct the new dataframe
df_final <- df %>% subset(select = -c(user, place, extended_tweet)) %>% cbind(user_new) %>% cbind(place_new) %>% cbind(full_text)
#Generate table
col_names <- colnames(df_final)
description <- c("creation time of tweet", "tweet id", "truncated tweet of length<140", "user id", "number of followers of user", "location of tweet", "full text of tweet")
knitr::kable(cbind(col_names,description), col.names = c("Features","Description"),caption="Feature Overview")
```
Of the 7 features, `user_id` and `follower_count` are sub-features from `user`. They captures the id and the number of followers of a user respectively. `location` is a sub-feature from `place`. It is a user-identified location, which they can put anything on it. `full_text` is a sub-feature from `extended tweet`. It captures tweets that has over 140 characters. We did not keep the feature `geo` as we found out that it has a missing percentage of almost 100%. It will show up in the missing value section to support out decision.
After selecting our basic features, we noticed that texts are stored in two different features. If the tweet is longer than 140 characters, the tweet would be truncated and put into `text`. The original tweet would be put into `full_text`. To make our analysis more convenient, we add a new variable called `original_text` to store the tweets.
```{r}
df_final <- df_final %>%
mutate(original_text = case_when(
is.na(df_final$full_text)==TRUE ~ df_final$text,
is.na(df_final$full_text)==FALSE ~ df_final$full_text
))
```
## Tokenization
Next, we conduct tokenization. This a process that will help us with other natural language processing analysis. Since there are some special characters like hashtags and usernames that might otherwise be stripped away using other tokenizer, we use the specific tokenize_tweets() function from the tokenizers library. Below is a demonstration for how it tokenizes one of the tweets.
```{r}
print(df_final$original_text[1])
print(tokenizers::tokenize_tweets(df_final$original_text[1]))
```
We then add `word_tokens` as a feature to our dataset.
```{r}
df_final <- df_final %>%
mutate(word_tokens = tokenizers::tokenize_tweets(df_final$original_text))
```
## Sentiment
Tweets have sentiments and here we try to classify a tweet as positive, negative or netural.
Before extracting sentiments from the tweets, we need to firstly clean the text such that it does not contain any special characters such as hashtags, `\`,`@`, website links, etc. Some special character might affect the accuracy of the sentiment score.
```{r}
cleaned_text <- gsub('http\\S+\\s*',"",df_final$original_text)
cleaned_text <- gsub('https\\S+\\s*',"",cleaned_text)
cleaned_text <- gsub("#","",cleaned_text)
cleaned_text <- gsub("@","",cleaned_text)
```
```{r}
cleaned_text <- cleaned_text %>% as.data.frame()
colnames(cleaned_text) <- c("cleaned_text")
df_final <- cbind(df_final, cleaned_text)
```
Below is a comparison between original text and the cleaned text. Typically `@` and website links are removed. After cleaning all the tweets, they were stored in a new feature `cleaned_text` in our dataset.
```{r}
print(df_final$original_text[220])
print(df_final$cleaned_text[220])
```
```{r}
# too slow, abandon for now
# We then extract the number of words related to each emotion using the NCR lexicon.
# df_new$emotion_word_counts <- syuzhet::get_nrc_sentiment(df_new$cleaned_text)
```
Now, we determine the sentiment score for each tweet using library "syuzhet", which is a custom sentiment dictionary developed in the Nebraska Literary Lab.The sentiment scores are stored in a new feature `sentiment_score`.
```{r}
df_final$sentiment_score <- syuzhet::get_sentiment(df_final$cleaned_text)
```
We then classify each tweet into three categories: positive(score>0), neutral(socre=0), and negative(score<0). We add `sentiment` as feature into our dataset to capture the sentiment category of a tweet.
```{r}
df_final <- df_final %>%
mutate(sentiment = case_when(
df_final$sentiment_score>0 ~ "positive",
df_final$sentiment_score<0 ~ "negative",
df_final$sentiment_score==0 ~ "neutral")
)
```
## Summary of added features
We added 5 new features into the dataset after some processing, summing to a total of 12 features. Those new features will help us better conduct analysis and visualizations. Below is a table describing all added features.
```{r}
col_names <- colnames(df_final)[8:12]
description <- c("original text", "a list of word tokens", "text after removing special characters", "sentiment scores", "sentiment of tweet: positive, neutral, or negative")
knitr::kable(cbind(col_names,description), col.names = c("variables", "descriptions"), caption = "Additional Feature Overview")
```
```{r}
# function to transform .json file raw data given file path
data_tranform <- function(filepath){
#load data from path
json_data <- stream_in(file(filepath))
#json_data <- read.csv(filepath)
df <- json_data[,c("created_at","id_str","text","geo","user","place","extended_tweet")]
# extract new variables and replace the old ones
user <- df[,"user"]%>%as.data.frame() #%>% subset(select = -c(withheld_in_countries) )
user_new <- user[, c("id_str", "followers_count")]
colnames(user_new) <- c("user_id", "user_follower_count")
place_new <- df[,c("place")] %>% subset(select = c(full_name))
colnames(place_new) <- c("location")
full_text <- df[,c("extended_tweet")] %>% subset(select = c(full_text))
colnames(full_text) <- c("full_text")
df_new <- df %>% subset(select = -c(user, place, extended_tweet)) %>% cbind(user_new) %>% cbind(place_new) %>% cbind(full_text)
# add original_text column
df_new <- df_new %>%
mutate(original_text = case_when(
is.na(df_new$full_text)==TRUE ~ df_new$text,
is.na(df_new$full_text)==FALSE ~ df_new$full_text
))
# tokenize words
#df_new$word_tokens <- tokenizers::tokenize_tweets(df_new$original_text)
#df_new$word_tokens <- tokenizers::tokenize_words(df_new$original_text)
# add clean_text column
cleaned_text <- gsub('http\\S+\\s*',"",df_new$original_text)
cleaned_text <- gsub('https\\S+\\s*',"",cleaned_text)
cleaned_text <- gsub("#","",cleaned_text)
cleaned_text <- gsub("@","",cleaned_text)
cleaned_text <- cleaned_text %>% as.data.frame()
colnames(cleaned_text) <- c("cleaned_text")
df_new <- cbind(df_new, cleaned_text)
#extract sentiment
df_new$sentiment_score <- syuzhet::get_sentiment(df_new$cleaned_text)
df_new <- df_new %>%
mutate(sentiment = case_when(
df_new$sentiment_score>0 ~ "positive",
df_new$sentiment_score<0 ~ "negative",
df_new$sentiment_score==0 ~ "neutral",
is.na(df_new$sentiment_score) ~ "neutral"
))
return(df_new)
}
```