This project enables the seamless extraction of Twitter data using the Twitter API and stores it efficiently in an AWS S3 bucket. The orchestration of this data pipeline is managed using Apache Airflow.
- Twitter API: Facilitates data collection from Twitter.
- Apache Airflow: Orchestrates and schedules ETL jobs.
- Amazon S3: Acts as the storage destination for the extracted Twitter data.
Operating System: Windows/Ubuntu OS Specifications: 4/8+ vCores, 4+ GB Memory, 32/64 GB Storage
- Twitter developer account and API key
- AWS account with S3 bucket configured
- Python 3
- Required Python packages: tweepy, pandas, boto3
- Clone the repository:
git clone [https://github.com/itsAniketNow/Twitter-Data-ETL]
- Install the required packages:
pip install -r requirements.txt
- Create a airflow_etl_key.pem file in the root directory and add the following variables:
AccessKey = "[Your Twitter API key]"
AccessSecret = "[Your Twitter API secret key]"
BearerToken = "[Your Twitter API bearer token]"
- Start the Airflow webserver:
airflow standalone
-
Access Airflow through your browser using the default port (8080).
-
In the Airflow UI, enable the twitter_DAG.py DAG.
-
The pipeline will run on the schedule defined in the DAG and load the data into the specified S3 bucket in CSV format.
- Customize the DAG schedule as per your requirements in the twitter_data_pipeline.py file.
- You can modify the ETL process to handle different data formats or storage destinations.
- Ensure proper error handling and logging for smooth pipeline execution.
Thank you for exploring this Twitter Data ETL project. For any inquiries or feedback, please contact Aniket Surwade.