Skip to content

Latest commit

 

History

History
51 lines (34 loc) · 1.6 KB

README.md

File metadata and controls

51 lines (34 loc) · 1.6 KB

Spark-Socket Streaming

Architecture

How to Run

  • Set up Confluent (Kafka) Cluster and Configuration

    • Go to Confluent Cloud

    • Create an environment with default cluster

    • Create an API Key for your Cluster and replace it with sasl.username and sasl.password in config.py

    • Get the Bootstrap server URL for your confluent kafka cluster and replace it with bootstrap.servers in config.py

    • Create an Schema Registry API Key and replace it with basic.auth.user.info in format <api_key>:<secret_jey> in config.py

    • Get the Schema Registry URL for your confluent kafka cluster and replace it with schema_registry in config.py

    • Create an Topic name customers_review with schema (AVRO Based) from here

    • Replace api_key in config.py with OPENAI-API Key.

  • Download the Data

    • Get the Yelp dataset from here in JSON format.
    • Unzip the dataset and place the unzipped folder named yelp_dataset in datasets
  • Run the services

cd src
docker-compose up -d --build
  • Start the Socket Server
docker exec -it spark-master python jobs/streaming-sockets.py
  • Start the Spark Streaming Job
docker exec -it spark-master spark-submit \
--master spark://spark-master:7077 \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
jobs/spark-streaming.py