-
Notifications
You must be signed in to change notification settings - Fork 27
/
Copy path16-Crawling-Followers-to-Approximate-Potential-Influence.Rmd
91 lines (64 loc) · 4.9 KB
/
16-Crawling-Followers-to-Approximate-Potential-Influence.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
output: html_document
editor_options:
chunk_output_type: console
---
# Crawling Followers to Approximate Primary Influence
## Problem
You want to approximate someone’s influence based upon their popularity and the popularity of their followers.
## Solution
Use the `rtweet::lookup_users()` and `rtweet::get_followers()` combination to pull primary influence and derive "primary influence" based on "followers-of-followers" counts.
## Discussion
"Influence" is _extremely_ more nuanced than both what the original Python chapter delved into and what this exercise shows. Building "#-removed" total reach counts from a large tree traversal (as the Python version suggests) is worthless on face-value since it doesn't take into account how many times any of the 'n-depth' of followers ever retweeted or even favorited content posted by the seminal user over the course of a certain period. Without gathering such stats, multi-depth "followers-of-followers" is nigh meaningless.
_However_, once-removed (so the follower counts of those directly following the target user) has some merit. Marketing folks have varying names for this statistic, so we'll just call it "primary influence" since there is legitimate potential of reaching this once-removed audience. Ideally, the retweet- and fav-counts should be factored in, but that can be added on as an exercise.
Let's create a helper function that will capture a snapshot of this _primary influence_ metric. It will take in a user id or name, pull in that user info and the details of their followers (specify `TRUE` to the `all` parameter if the user has more than 5,000 followers and you want to wait even longer to get their complete influence) and then sum up all the follower counts to get the overall reach number. It returns this information (so it can be processed again without API calls) but also produces a graph of the "number of followers" distribution of the first-level followers. This is usually a heavily skewed distribution so the function also defaults to a log scale, but can be overridden to use a linear scale (log scale will be the correct choice the vast majority of the time, but the function can be modified --- as an exercise -- to test the distribution and auto-pick scales).
```{r 16_lib, message=FALSE, warning=FALSE}
library(rtweet)
library(hrbrthemes)
library(tidyverse)
```
```{r include=FALSE, echo=FALSE}
extrafont::loadfonts(quiet=TRUE)
```
```{r 16_snap, message=FALSE, warning=FALSE, cache=TRUE}
influence_snapshot <- function(user, all = FALSE, trans=c("log10", "identity")) {
user <- user[1]
trans <- match.arg(tolower(trimws(trans[1])), c("log10", "identity"))
scale_lab <- ""
if (trans == "log10") sclae_lab <- " (log scale)"
user_info <- lookup_users(user)
n <- if (all[1]) user_info$followers_count else 5000
user_followers <- get_followers(user_info$user_id)
uf_details <- lookup_users(user_followers$user_id)
primary_influence <- sum(c(uf_details$followers_count, user_info$followers_count))
filter(uf_details, followers_count > 0) %>%
ggplot(aes(followers_count)) +
geom_density(aes(y=..count..), color="lightslategray", fill="lightslategray",
alpha=2/3, size=1) +
scale_x_continuous(expand=c(0,0), trans=trans, labels=scales::comma) +
scale_y_comma() +
labs(
x=sprintf("Number of Followers of Followers%s", scale_lab),
y="Number of Followers",
title=sprintf("Follower chain distribution of %s (@%s)", user_info$name, user_info$screen_name),
subtitle=sprintf("Follower count: %s; Primary influence/reach: %s",
scales::comma(user_info$followers_count),
scales::comma(primary_influence))
) +
theme_ipsum_rc(grid="XY") -> gg
print(gg)
return(invisible(list(user_info=user_info, follower_details=uf_details)))
}
```
Let's run it on [Julia Silge](https://twitter.com/juliasilge), an incredibly talented data scientist over at Stack Overflow and co-author of [Tidy Text Mining with R](https://www.tidytextmining.com/) --- a book that should be on your shelf _especially_ if you're doing Twitter mining.
```{r 16_js, message=FALSE, warning=FALSE, cache=TRUE, fig.width=10, fig.height=4.5}
juliasilge <- influence_snapshot("juliasilge")
glimpse(juliasilge)
```
You'll see that distribution shape quite a bit given the general nature of the social structure of Twitter. Don't believe that? Let's do one more, this time for [Maëlle Salmon](https://twitter.com/ma_salmon/), another incredibly talented data scientist with a [blog](http://www.masalmon.eu/) you _must_ follow if you want to learn how to do fun and useful things with R.
```{r 16_ms, message=FALSE, warning=FALSE, cache=TRUE, fig.width=10, fig.height=4.5}
ma_salmon <- influence_snapshot("ma_salmon")
glimpse(ma_salmon)
```
## See Also
- Simply Measured's [guide to measuring influence on Twitter](https://simplymeasured.com/blog/7-ways-to-measure-influence-on-twitter/)