GitHub - tranhuy25/web-scraping: The process includes steps from data collection (web scraping), data processing with PySpark, to process management with Apache Airflow. You can expand this project by adding more complex data processing tasks or deploying the process on different schedules through Airflow.

Web Scraping ETL Project This project involves scraping data from websites, processing it using PySpark, and managing the entire ETL (Extract, Transform, Load) process with Apache Airflow.

Contents Requirements Environment Setup Running the Project Configuration Project Structure References Requirements Docker and Docker Compose Python 3.9 Environment Setup Clone the Repository git clone <URL_OF_YOUR_REPOSITORY>

Install Docker Ensure that you have Docker and Docker Compose installed. You can download and install Docker from the official Docker website.

Create .env File Create a .env file in the config/ directory with the following content, and replace the values with your actual information: plaintext AWS_ACCESS_KEY_ID=your-access-key AWS_SECRET_ACCESS_KEY=your-secret-key DB_HOST=localhost DB_PORT=5432 DB_USER=your-username DB_PASSWORD=your-password DB_NAME=web_log_db Install Dependencies Build the Docker image and install dependencies by running:

docker-compose build Running the Project Start the Services Start the services using Docker Compose:

docker-compose up The PostgreSQL service will run PostgreSQL to store data. The Airflow service will run Apache Airflow to manage the ETL process. Access Apache Airflow Open a browser and go to http://localhost:8080 to access the Apache Airflow interface. Log in with the default credentials airflow/airflow.

Monitor DAG In the Airflow interface, you will see the vnexpress_etl DAG. Activate this DAG to start the ETL process.

Configuration Dockerfile: Configuration to create the Docker image for the project, including PySpark and Apache Airflow. docker-compose.yml: Configuration for the services, including PostgreSQL and Airflow. vnexpress_etl_dag.py: Defines the ETL process using Airflow. vnexpress_processing.py: Processes the collected data using PySpark. requirements.txt: List of dependencies required for the project. Project Structure airflow/dags/: Contains Airflow DAGs. config/: Contains configuration files and environment variables. data/: Directory for data collected from VNExpress. pyspark_jobs/: Contains PySpark code for data processing. requirements.txt: List of libraries to install. Dockerfile: Dockerfile to create the Docker image. docker-compose.yml: Docker Compose configuration to manage services. References Apache Airflow Documentation PySpark Documentation BeautifulSoup Documentation Requests Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
airflow		airflow
config		config
data		data
pyspark-jobs		pyspark-jobs
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
requireements.txt		requireements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

tranhuy25/web-scraping

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages