-
Set up Confluent (Kafka) Cluster and Configuration
-
Go to Confluent Cloud
-
Create an environment with default cluster
-
Create an API Key for your Cluster and replace it with
sasl.username
andsasl.password
in config.py -
Get the Bootstrap server URL for your confluent kafka cluster and replace it with
bootstrap.servers
in config.py -
Create an Schema Registry API Key and replace it with
basic.auth.user.info
in format<api_key>:<secret_jey>
in config.py -
Get the Schema Registry URL for your confluent kafka cluster and replace it with
schema_registry
in config.py -
Create an Topic name
customers_review
with schema (AVRO Based) from here -
Replace
api_key
in config.py with OPENAI-API Key.
-
-
Download the Data
- Get the Yelp dataset from here in
JSON
format. - Unzip the dataset and place the unzipped folder named
yelp_dataset
indatasets
- Get the Yelp dataset from here in
-
Run the services
cd src
docker-compose up -d --build
- Start the Socket Server
docker exec -it spark-master python jobs/streaming-sockets.py
- Start the Spark Streaming Job
docker exec -it spark-master spark-submit \
--master spark://spark-master:7077 \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
jobs/spark-streaming.py