Spark-Socket Streaming

Architecture

How to Run

Set up Confluent (Kafka) Cluster and Configuration
- Go to Confluent Cloud
- Create an environment with default cluster
- Create an API Key for your Cluster and replace it with sasl.username and sasl.password in config.py
- Get the Bootstrap server URL for your confluent kafka cluster and replace it with bootstrap.servers in config.py
- Create an Schema Registry API Key and replace it with basic.auth.user.info in format <api_key>:<secret_jey> in config.py
- Get the Schema Registry URL for your confluent kafka cluster and replace it with schema_registry in config.py
- Create an Topic name customers_review with schema (AVRO Based) from here
- Replace api_key in config.py with OPENAI-API Key.
Download the Data
- Get the Yelp dataset from here in JSON format.
- Unzip the dataset and place the unzipped folder named yelp_dataset in datasets
Run the services

cd src

docker-compose up -d --build

Start the Socket Server

docker exec -it spark-master python jobs/streaming-sockets.py

Start the Spark Streaming Job

docker exec -it spark-master spark-submit \
--master spark://spark-master:7077 \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
jobs/spark-streaming.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spark-Socket Streaming

Architecture

How to Run

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spark-Socket Streaming

Architecture

How to Run