pip install -r requirements.txt
docker-compose up -d
ec~~ho 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc source ~/.bashrc~~
uvicorn api:api --reload
streamlit run streamlit_app/app.py
To start, clone the repository to your local machine:
git clone https://github.com/your-username/your-repo.git
cd your-repo
To start all necessary services (Kafka, Cassandra, Spark, and Airflow), use Docker Compose:
docker-compose up -d
To set up Airflow with the necessary configurations:
- Place the
spark_streaming_dag.py
file in the Airflow DAGs directory. - Ensure Kafka and Cassandra connections are properly configured in Airflow.
- Add a new Spark connection in Airflow:
-
Go to Admin > Connections.
-
Click on Add a new record.
-
Fill in the details:
- Connection ID:
spark_default
- Connection Type:
Spark
- Host:
spark://spark-master:7077
- Extra:
{ "master": "spark://spark-master:7077", "deploy-mode": "client", "spark-home": "/home/airflow/.local" }
- Connection ID:
To execute the streaming job:
- Go to the Airflow web interface.
- Locate the
spark_streaming_dag
andbinance-streaming-orchestration
. - Toggle the DAG to On to enable scheduling.
- Alternatively, trigger the DAG manually by clicking on the play button next to the DAG name.
You can also run the Spark job manually using spark-submit
if needed.
Actually you will need to run manually the script spark_stream inside pycharm or another ide to process the insertion into cassandra.
Connectivity Issues: Ensure that all services (Kafka, Spark, Cassandra) are properly running and accessible. Use tools like ping
and telnet
to check connectivity.
Schema Mismatches: Verify that the Kafka topic schema matches the schema expected by Spark.
Resource Constraints: Adjust the Spark executor and driver memory settings based on your environment.
Logging: Review the logs in Airflow and Spark to identify errors. Use the Airflow UI or access logs directly from the containers.
- Add Unit Tests: Incorporate unit testing for Spark jobs and data validation.
- Improve Scalability: Optimize Spark configurations for larger datasets.
- Integrate Monitoring Dashboards: Use tools like Grafana or Prometheus for monitoring cluster and job performance.
This project is licensed by DataScientest and us.