Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Add topic mine support for the Pushshift verified twitter archive #747

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

epenn
Copy link
Member

@epenn epenn commented Dec 15, 2020

This adds support for pulling data from the Pushshift verified twitter archive. A couple things of note:

  • Implemented using Elasticsearch's scroll API for paging support.
  • This makes a second round trip after getting the results to fill in retweeted_status and quoted_status fields (when needed) since Pushshift optimizes for space by removing the payloads from quote tweets and retweets.

@epenn epenn requested a review from hroberts December 15, 2020 14:59
@epenn epenn assigned epenn and hroberts and unassigned epenn Dec 15, 2020
@pypt
Copy link
Contributor

pypt commented Feb 9, 2021

@epenn would you be able to rebase this off the top of the current master branch?

@rahulbot rahulbot requested review from pypt and removed request for hroberts July 6, 2021 14:38
@rahulbot
Copy link
Contributor

rahulbot commented Jul 6, 2021

We have the OK to deploy this. Is the code ready to merge and release?

Copy link
Contributor

@pypt pypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for comprehensible code comments Eric!

For the reference, could you also ELI5 what this PR is all about, i.e. what's "verified Twitter" archive and what is it that we're going to be doing here? I genuinely don't know :)

@@ -185,6 +184,7 @@ END;
$$
LANGUAGE plpgsql;

>>>>>>> origin/master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from a merge.

-- 1 of 2. Import the output of 'apgdiff':
--

select insert_platform_source_pair( 'twitter', 'pushshift' );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this bit should be in the mediawords.sql too.

PS_TWITTER_PAGE_SIZE = 10000
PS_TWITTER_SCROLL_TIMEOUT = '1m'
PS_TWITTER_SCROLL_URL = 'https://twitter-es.pushshift.io/_search/scroll'
PS_TWITTER_URL = 'https://twitter-es.pushshift.io/twitter_verified/_search?scroll=%s' % PS_TWITTER_SCROLL_TIMEOUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use f-strings here and elsewhere (if applicable)? I.e. f'https://twitter-es.pushshift.io/twitter_verified/_search?scroll={PS_TWITTER_SCROLL_TIMEOUT}'

}


def _mock_elasticsearch_response(posts: dict, scroll_id) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere, would you be able to use more precise typing annotations?

Here's out 30 seconds tutorial on those: https://github.com/mediacloud/backend/blob/master/doc/coding_guidelines.markdown#declare-function-parameter-and-return-value-types

One doesn't have to go too crazy with those, but an IDE would be really happy if it was hinted that posts is a Dict[str, Union[int, dict, bool, str]] (or something like that), and what's the type of scroll_id anyway - is it an int or a str? Can it be empty? Same for return values.

Comment on lines +42 to +58
def _mock_elasticsearch_hit(post: dict) -> dict:
"""Mock an ElasticSearch hit for a Pushshift tweet."""

return {
'_index': 'twitter_verified',
'_type': '_doc',
'_id': post['post_id'],
'_routing': post['post_id'],
'_score': 123.456,
'_source': {
'id': int(post['post_id']),
'id_str': post['post_id'],
'screen_name': post['author'],
'text': post['content'],
'created_at': post['publish_date']
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to move those _mock* helpers to the test file which is doing the actual mocking?

Comment on lines +180 to +181
Returns: None, referenced tweets are stored in tweet['quoted_status'] or
tweet['retweeted_status'] as appropriate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this modifies the parameter tweets in-place, right?

Could you make it return the modified tweets instead? Now it's a bit hard to figure out which specific method does add those quoted_status / retweeted_status to the tweets. We have a plenty of those C-style "pass a reference" methods that change up their arguments in subtle ways all over our codebase, and I can't say I like them too much :)

'post_id': tweet['id_str'],
'data': tweet,
'content': tweet['text'],
'publish_date': publish_date,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you verify that publish_date gets converted to America/New_York timezone at some point, in the caller perhaps? Unfortunately topic_posts.publish_date is a TIMESTAMP [WITHOUT A TIME ZONE] in PostgreSQL, so we have to normalize all the dates that come in to EST / EEST (or maybe it's UTC for topic_posts specifically? Either way, it has to point to the same moment of time somehow), and I don't know whether it's the caller that does that, or are you supposed to do it in the post ingest module such as this one.

@pypt
Copy link
Contributor

pypt commented Jul 19, 2021

Oh, and I have merged in master, hope that's okay.

@pypt
Copy link
Contributor

pypt commented Jul 19, 2021

Could you also have a look at the failing tests too?

@rahulbot
Copy link
Contributor

I can handle the first question on context and purpose. This is part of our effort to build cross-platform topics. Jason over at PushShift.io runs an archive of "verified" tweets that he ingests and maintains. This code allows us to import tweets from this archive via adding another it as a platform in a Topic. So it queries his API for matching tweets, extract any shared links from them, and adds those into the Topic to be processed (and saves the tweets too). So at a high level this lets us discover links being shared in tweets about a topic and saves attention metrics about them.

@pypt
Copy link
Contributor

pypt commented Jul 19, 2021

Thanks Rahul!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants