Skip to content

The process includes steps from data collection (web scraping), data processing with PySpark, to process management with Apache Airflow. You can expand this project by adding more complex data processing tasks or deploying the process on different schedules through Airflow.

Notifications You must be signed in to change notification settings

tranhuy25/web-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraping ETL Project This project involves scraping data from websites, processing it using PySpark, and managing the entire ETL (Extract, Transform, Load) process with Apache Airflow.

Contents Requirements Environment Setup Running the Project Configuration Project Structure References Requirements Docker and Docker Compose Python 3.9 Environment Setup Clone the Repository git clone <URL_OF_YOUR_REPOSITORY>

Install Docker Ensure that you have Docker and Docker Compose installed. You can download and install Docker from the official Docker website.

Create .env File Create a .env file in the config/ directory with the following content, and replace the values with your actual information: plaintext AWS_ACCESS_KEY_ID=your-access-key AWS_SECRET_ACCESS_KEY=your-secret-key DB_HOST=localhost DB_PORT=5432 DB_USER=your-username DB_PASSWORD=your-password DB_NAME=web_log_db Install Dependencies Build the Docker image and install dependencies by running:

docker-compose build Running the Project Start the Services Start the services using Docker Compose:

docker-compose up The PostgreSQL service will run PostgreSQL to store data. The Airflow service will run Apache Airflow to manage the ETL process. Access Apache Airflow Open a browser and go to http://localhost:8080 to access the Apache Airflow interface. Log in with the default credentials airflow/airflow.

Monitor DAG In the Airflow interface, you will see the vnexpress_etl DAG. Activate this DAG to start the ETL process.

Configuration Dockerfile: Configuration to create the Docker image for the project, including PySpark and Apache Airflow. docker-compose.yml: Configuration for the services, including PostgreSQL and Airflow. vnexpress_etl_dag.py: Defines the ETL process using Airflow. vnexpress_processing.py: Processes the collected data using PySpark. requirements.txt: List of dependencies required for the project. Project Structure airflow/dags/: Contains Airflow DAGs. config/: Contains configuration files and environment variables. data/: Directory for data collected from VNExpress. pyspark_jobs/: Contains PySpark code for data processing. requirements.txt: List of libraries to install. Dockerfile: Dockerfile to create the Docker image. docker-compose.yml: Docker Compose configuration to manage services. References Apache Airflow Documentation PySpark Documentation BeautifulSoup Documentation Requests Documentation

About

The process includes steps from data collection (web scraping), data processing with PySpark, to process management with Apache Airflow. You can expand this project by adding more complex data processing tasks or deploying the process on different schedules through Airflow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages