Skip to content

Latest commit

 

History

History
175 lines (133 loc) · 5.53 KB

README.md

File metadata and controls

175 lines (133 loc) · 5.53 KB

Crawl Twitter Social Networks using the Twitter API v2 with Tweepy and Python.

Table of contents:


⭐ Introduction

This repository contains a Twitter crawler that is able to crawl the following data using Twitter API v2 with the Tweepy library and Python:

  • User profile (e.g., name, profile description, following count, follower count, tweet count, etc.)
  • Tweet content (e.g., text, author, hashtag, topic, etc.)
  • Follow interactions
  • Post interactions
  • Retweet interactions
  • Like interactions

🔑 Prerequisites

In order to work with Twitter API v2, you have to sign up for a developer account to get keys and tokens for Twitter API access. Learn more about how to access the Twitter API

  1. Sign up for a Twitter developer account.
  2. Create a Project and App.
  3. Find or generate the following credentials within your developer App:
    • Consumer Key and Secret
    • Access Token and Secret
    • Bearer Token

Basically, Twitter allows you to access tweets from the last 7 days only. If you would like to access the full archive of tweets, Academic Research access is required. Learn more about Acedemic Research access


📖 Dependencies

The script is implemented under the following dependencies:

  • tweepy==4.13.0
  • pandas==2.0.0
  • omegaconf==2.3.0
  • tqdm==4.65.0

Install all dependencies:

pip install -r requirements.txt

⚙️ Configure the crawler

You can manage configuration at config.yaml.

  1. Specify whether you have Academic Research access.
ACADEMIC_ACCESS: True
  1. Place your credentials.
# Place your credentials.
CONSUMER_KEY: "<REPLACE ME>"
CONSUMER_SECRET: "<REPLACE ME>"
ACCESS_TOKEN: "<REPLACE ME>"
ACCESS_SECRET: "<REPLACE ME>"
BEARER_TOKEN: "<REPLACE ME>"
  1. Specify your base query, tweet field, user field, time period, and crawling limit.

🔍 Build queries

You can specify your queries in query.json. Learn more about how to build queries

  • Ex: Searching for tweets that contain the keyword "twitter" from "Elon Musk".
[
    {
        "context": "",
        "keyword": "twitter",
        "user": "from:elonmusk"
    }
]

You can also search by a specific domain and entity. Learn more about tweet annotations and See all available domain and entity

  • Ex: Searching for tweets within the "Tech News" domain that contain the hashtag "#tech" and one of "#apple" or "#iphone".
[
    {
        "context": "context:131.840160819388141570",
        "keyword": "#tech (#apple OR #iphone)",
        "user": ""
    }
]

▶️ Run the crawler

You can run the crawler using the following command:

python crawler.py
You can also parse arguments to the command:
python crawler.py [--cfg] <config_file>

👀 View the crawler output

Once crawling is finished, the following .csv files are created under the ./data directory:

data
└─ tweet.csv
└─ user.csv
└─ follow.csv
└─ post.csv
└─ retweet.csv
└─ like.csv

Examples of crawler outputs:

  • Examples of tweet.csv:

    id author_id text
    1649919766742614017 44196397 'The least bad solution to the AGI control problem that I can think of is to give every verified human a vote'
  • Examples of user.csv:

    id name username description verified followers_count following_count tweet_count
    44196397 Elon Musk elonmusk nothing True 136395633 241 25051
  • Examples of follow.csv:

    user_id following_id
    44196397 797727562235609088
  • Examples of post.csv, retweet.csv, and like.csv:

    user_id tweet_id
    44196397 1649919766742614017