forked from PsyTeachR/ads-v1
-
Notifications
You must be signed in to change notification settings - Fork 5
/
app-hashtags.qmd
223 lines (166 loc) · 10 KB
/
app-hashtags.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
# Twitter Hashtags {#sec-twitter-hashtags}
In this appendix, we will create a table of the top five hashtags used in conjunction with #NationalComingOutDay, the total number of tweets in each hashtag, the total number of likes, and the top tweet for each hashtag.
```{r setup-app-g, message=FALSE}
library(tidyverse) # data wrangling functions
library(rtweet) # for searching tweets
library(glue) # for pasting strings
library(kableExtra) # for nice tables
```
The example below uses the data from @sec-summary (<a href="data/ncod_tweets.rds" download>which you can download</a>), but we encourage you to try a hashtag that interests you.
```{r, eval = FALSE}
# load tweets
tweets <- search_tweets(q = "#NationalComingOutDay",
n = 30000,
include_rts = FALSE)
# save them to a file so you can skip this step in the future
saveRDS(tweets, file = "data/ncod_tweets.rds")
```
```{r}
# load tweets from the file
tweets <- readRDS("data/ncod_tweets.rds")
```
## Select relevant data
The function `select()` is useful for just keeping the variables (columns) you need to work with, which can make working with very large datasets easier. The arguments to `select()` are simply the names of the variables and the resulting table will present them in the order you specify.
```{r}
tweets_with_hashtags <- tweets %>%
select(hashtags, text, favorite_count, media_url)
```
## Unnest columns
Look at the dataset using `View(tweets_with_hashtags)` or clicking on it in the Environment tab. You'll notice that the variable `hashtags` has multiple values in each cell (i.e., when users used more than one hashtag in a single tweet). In order to work with this information, we need to separate each hashtag so that each row of data represents a single hashtag. We can do this using the function `unnest()` and adding a pipeline of code.
```{r}
tweets_with_hashtags <- tweets %>%
select(hashtags, text, favorite_count, media_url) %>%
unnest(cols = hashtags)
```
::: {.callout-note .try}
Look at `tweets_with_hashtags` to see how it is different from the table `tweets`. WHy does it have more rows?
:::
## Top 5 hashtags
To get the top 5 hashtags we need to know how tweets used each one. This code uses pipes to build up the analysis. When you encounter multi-pipe code, it can be very useful to run each line of the pipeline to see how it builds up and to check the output at each step. This code:
* Starts with the object `tweets_with_hashtags` and then;
* Counts the number of tweets for each hashtag using `count()` and then;
* Filters out any blank cells using `!is.na()` (you can read this as "keep any row value where it is not true (`!`) that `hashtags` is missing") and then;
* Returns the top five values using `slice_max()` and orders them by the `n` column.
```{r}
top5_hashtags <- tweets_with_hashtags %>%
count(hashtags) %>%
filter(!is.na(hashtags)) %>% # get rid of the blank value
slice_max(order_by = n, n = 5)
top5_hashtags
```
Two of the hashtags are the same, but with different case. We can fix this by adding in an extra line of code that uses `mutate()` to overwrite the variable `hashtag` with the same data but transformed to lower case using `tolower()`. Since we're going to use the table `tweets_with_hashtags` a few more times, let's change that table first rather than having to fix this every time we use the table.
```{r}
tweets_with_hashtags <- tweets_with_hashtags %>%
mutate(hashtags = tolower(hashtags))
top5_hashtags <- tweets_with_hashtags %>%
count(hashtags) %>%
filter(!is.na(hashtags)) %>% # get rid of the blank value
slice_max(order_by = n, n = 5)
top5_hashtags
```
## Top tweet per hashtag
Next, get the top tweet for each hashtag using `filter()`. Use `group_by()` before you filter to select the most-liked tweet in each hashtag, rather than the one with most likes overall. As you're getting used to writing and running this kind of multi-step code, it can be very useful to take out individual lines and see how it changes the output to strengthen your understanding of what each step is doing.
```{r}
top_tweet_per_hashtag <- tweets_with_hashtags %>%
group_by(hashtags) %>%
filter(favorite_count == max(favorite_count)) %>%
sample_n(size = 1) %>%
ungroup()
```
::: {.callout-note .try}
The function `slice_max()` accomplishes the same thing as the `filter()` and `sample_n()` functions above. Look at the help for this function and see if you can figure out how to use it.
```{r, webex.hide = TRUE, eval = FALSE}
top_tweet_per_hashtag <- tweets_with_hashtags %>%
group_by(hashtags) %>%
slice_max(
order_by = favorite_count,
n = 1, # select the 1 top value
with_ties = FALSE # don't include ties
) %>%
ungroup()
```
:::
## Total likes per hashtag
Get the total number of likes per hashtag by grouping and summarising with `sum()`.
```{r}
likes_per_hashtag <- tweets_with_hashtags %>%
group_by(hashtags) %>%
summarise(total_likes = sum(favorite_count)) %>%
ungroup()
```
## Put it together
We can put everything together using `left_join()` (see @sec-left_join). This will keep everything from the first table specified and then add on the relevant data from the second table specified. In this case, we add on the data in `top_tweet_per_hashtag` and `like_per_hashtag` but only for the tweets included in `top5_hashtags`
```{r}
top5 <- top5_hashtags %>%
left_join(top_tweet_per_hashtag, by = "hashtags") %>%
left_join(likes_per_hashtag, by = "hashtags")
```
## Twitter data idiosyncrasies
Before we can finish up though, there's a couple of extra steps we need to add in to account for some of the idiosyncrasies of Twitter data.
First, the `@` symbol is used by R Markdown for referencing (see @sec-references). It's likely that some of the tweets will contain this symbol, so we can use mutate to find any instances of `@` and `r glossary("escape")` them using backslashes. Backslashes create a `r glossary("literal")` version of characters that have a special meaning in R, so adding them means it will print the `@` symbol without trying to create a reference. Of course `\` also has a special meaning in R, which means we also need to backslash the backslash. Isn't programming fun? We can use the same code to tidy up any ampersands (&), which sometimes display as "&".
Second, if there are multiple images associated with a single tweet, `media_url` will be a list, so we use `unlist()`. This might not be necessary for a different set of tweets; use `glimpse()` to check the data types.
Finally, we use `select()` to tidy up the table and just keep the columns we need.
```{r}
top5 <- top5_hashtags %>%
left_join(top_tweet_per_hashtag, by = "hashtags") %>%
left_join(likes_per_hashtag, by = "hashtags") %>%
# replace @ with \@ so @ doesn't trigger referencing
mutate(text = gsub("@", "\\\\@", text),
text = gsub("&", "&", text)) %>%
# media_url can be a list if there is more than one image
mutate(image = unlist(media_url)) %>%
# put the columns you want to display in order
select(hashtags, n, total_likes, text, image)
top5
```
## Make it prettier
Whilst this table now has all the information we want, it isn't great aesthetically. The <pkg>kableExtra</pkg> package has functions that will improve the presentation of tables. We're going to show you two examples of how you could format this table.
The first is (relatively) simple and stays within the R programming language using functionality from <pkg>kableExtra</pkg>. The main aesthetic feature of the table is the incorporation of the pride flag colours for each row. Each row is set to a different colour of the pride flag and the text is set to be black and bold to improve the contrast. We've also removed the `image` column, as it just contains a URL.
```{r}
# the hex codes of the pride flag colours, obtained from https://www.schemecolor.com/lgbt-flag-colors.php
# the last two characters (80) make the colours semi-transparent.
# omitting them or setting to FF make them 100% opaque
pride_colours <- c("#FF001880",
"#FFA52C80",
"#FFFF4180",
"#00801880",
"#0000F980",
"#86007D80")
top5 %>%
select(-image) %>%
kable(col.names = c("Hashtags", "No. tweets", "Likes", "Tweet"),
caption = "Stats and the top tweet for the top five hashtags.",
) %>%
kable_paper() %>%
row_spec(row = 0:5, bold = T, color = "black") %>%
row_spec(row = 0, font_size = 18,
background = pride_colours[1]) %>%
row_spec(row = 1, background = pride_colours[2])%>%
row_spec(row = 2, background = pride_colours[3])%>%
row_spec(row = 3, background = pride_colours[4])%>%
row_spec(row = 4, background = pride_colours[5])%>%
row_spec(row = 5, background = pride_colours[6])
```
## Customise with HTML
An alternative approach incorporates `r glossary("HTML")` and also uses the package <pkg>glue</pkg> to combine information from multiple columns.
First, we use `mutate()` to create a new column `col1` that combines the first three columns into a single column and adds some formatting to make the hashtag bold (`<strong>`) and insert line breaks (`<br>`). We'll also change the image column to display the image using html if there is an image.
If you're not familiar with HTML, don't worry if you don't understand the below code; the point is to show you the full extent of the flexibility available.
```{r eval = TRUE}
top5 %>%
mutate(col1 = glue("<strong>#{hashtags}</strong>
<br>
tweets: {n}
<br>
likes: {total_likes}"),
img = ifelse(!is.na(image),
glue("<img src='{image}' width='200px' />"),
"")) %>%
select(col1, text, img) %>%
kable(
escape = FALSE, # allows HTML in the table
col.names = c("Hashtag", "Top Tweet", ""),
caption = "Stats and the top tweet for the top five hashtags.") %>%
column_spec(1:2, extra_css = "vertical-align: top;") %>%
row_spec(0, extra_css = "vertical-align: bottom;") %>%
kable_paper()
```