This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Add topic mine support for the Pushshift verified twitter archive #747

Open

epenn wants to merge 5 commits into master from topic-mine-pushshift-twitter

Member

epenn commented Dec 15, 2020

This adds support for pulling data from the Pushshift verified twitter archive. A couple things of note:

Implemented using Elasticsearch's scroll API for paging support.
This makes a second round trip after getting the results to fill in retweeted_status and quoted_status fields (when needed) since Pushshift optimizes for space by removing the payloads from quote tweets and retweets.


          Add topic mine support for the Pushshift verified twitter API

epenn requested a review from hroberts

December 15, 2020 14:59

epenn assigned epenn and hroberts and unassigned epenn

Contributor

pypt commented Feb 9, 2021

@epenn would you be able to rebase this off the top of the current master branch?

epenn added 3 commits

March 2, 2021 10:31


          Merge latest changes from master and resolve conflicts

0f2f962


          Merge latest changes from master

31b913d


          Bump schema version

e647029

rahulbot requested review from pypt and removed request for hroberts

July 6, 2021 14:38

rahulbot unassigned hroberts

Contributor

rahulbot commented Jul 6, 2021

We have the OK to deploy this. Is the code ready to merge and release?


          Merge branch 'master' into topic-mine-pushshift-twitter

60d3686

pypt suggested changes

View reviewed changes

Contributor

pypt left a comment

Thank you for comprehensible code comments Eric!

For the reference, could you also ELI5 what this PR is all about, i.e. what's "verified Twitter" archive and what is it that we're going to be doing here? I genuinely don't know :)

apps/postgresql-server/schema/migrations/mediawords-4758-4759.sql

@@ @@ -185,6 +184,7 @@ END; @@
               $$
               LANGUAGE plpgsql;
+              >>>>>>> origin/master

Contributor

pypt Jul 19, 2021

Leftover from a merge.

apps/postgresql-server/schema/migrations/mediawords-4761-4762.sql

+              -- 1 of 2. Import the output of 'apgdiff':
+              --
+              select insert_platform_source_pair( 'twitter', 'pushshift' );

Contributor

pypt Jul 19, 2021

I think this bit should be in the mediawords.sql too.

apps/topics-mine/src/python/topics_mine/posts/pushshift_twitter.py

+              PS_TWITTER_PAGE_SIZE = 10000
+              PS_TWITTER_SCROLL_TIMEOUT = '1m'
+              PS_TWITTER_SCROLL_URL = 'https://twitter-es.pushshift.io/_search/scroll'
+              PS_TWITTER_URL = 'https://twitter-es.pushshift.io/twitter_verified/_search?scroll=%s' % PS_TWITTER_SCROLL_TIMEOUT

Contributor

pypt Jul 19, 2021

Could you use f-strings here and elsewhere (if applicable)? I.e. f'https://twitter-es.pushshift.io/twitter_verified/_search?scroll={PS_TWITTER_SCROLL_TIMEOUT}'

apps/topics-mine/src/python/topics_mine/posts/pushshift_twitter.py

		}


		def _mock_elasticsearch_response(posts: dict, scroll_id) -> dict:

Contributor

pypt Jul 19, 2021

Here and elsewhere, would you be able to use more precise typing annotations?

Here's out 30 seconds tutorial on those: https://github.com/mediacloud/backend/blob/master/doc/coding_guidelines.markdown#declare-function-parameter-and-return-value-types

One doesn't have to go too crazy with those, but an IDE would be really happy if it was hinted that posts is a Dict[str, Union[int, dict, bool, str]] (or something like that), and what's the type of scroll_id anyway - is it an int or a str? Can it be empty? Same for return values.

apps/topics-mine/src/python/topics_mine/posts/pushshift_twitter.py

Comment on lines +42 to +58

+              def _mock_elasticsearch_hit(post: dict) -> dict:
+                  """Mock an ElasticSearch hit for a Pushshift tweet."""
+                  return {
+                      '_index': 'twitter_verified',
+                      '_type': '_doc',
+                      '_id': post['post_id'],
+                      '_routing': post['post_id'],
+                      '_score': 123.456,
+                      '_source': {
+                          'id': int(post['post_id']),
+                          'id_str': post['post_id'],
+                          'screen_name': post['author'],
+                          'text': post['content'],
+                          'created_at': post['publish_date']
+                      }
+                  }

Contributor

pypt Jul 19, 2021

Would it be possible to move those _mock* helpers to the test file which is doing the actual mocking?

apps/topics-mine/src/python/topics_mine/posts/pushshift_twitter.py

Comment on lines +180 to +181

		Returns: None, referenced tweets are stored in tweet['quoted_status'] or
		tweet['retweeted_status'] as appropriate.

Contributor

pypt Jul 19, 2021

So this modifies the parameter tweets in-place, right?

Could you make it return the modified tweets instead? Now it's a bit hard to figure out which specific method does add those quoted_status / retweeted_status to the tweets. We have a plenty of those C-style "pass a reference" methods that change up their arguments in subtle ways all over our codebase, and I can't say I like them too much :)

apps/topics-mine/src/python/topics_mine/posts/pushshift_twitter.py

+                              'post_id': tweet['id_str'],
+                              'data': tweet,
+                              'content': tweet['text'],
+                              'publish_date': publish_date,

Contributor

pypt Jul 19, 2021

Could you verify that publish_date gets converted to America/New_York timezone at some point, in the caller perhaps? Unfortunately topic_posts.publish_date is a TIMESTAMP [WITHOUT A TIME ZONE] in PostgreSQL, so we have to normalize all the dates that come in to EST / EEST (or maybe it's UTC for topic_posts specifically? Either way, it has to point to the same moment of time somehow), and I don't know whether it's the caller that does that, or are you supposed to do it in the post ingest module such as this one.

Contributor

pypt commented Jul 19, 2021

Oh, and I have merged in master, hope that's okay.

Contributor

pypt commented Jul 19, 2021

Could you also have a look at the failing tests too?

Contributor

rahulbot commented Jul 19, 2021

I can handle the first question on context and purpose. This is part of our effort to build cross-platform topics. Jason over at PushShift.io runs an archive of "verified" tweets that he ingests and maintains. This code allows us to import tweets from this archive via adding another it as a platform in a Topic. So it queries his API for matching tweets, extract any shared links from them, and adds those into the Topic to be processed (and saves the tweets too). So at a high level this lets us discover links being shared in tweets about a topic and saves attention metrics about them.

Contributor

pypt commented Jul 19, 2021

Thanks Rahul!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet