Spark-Socket Streaming

Architecture

How to Run

Set up Confluent (Kafka) Cluster and Configuration
- Go to Confluent Cloud
- Create an environment with default cluster
- Create an API Key for your Cluster and replace it with sasl.username and sasl.password in config.py
- Get the Bootstrap server URL for your confluent kafka cluster and replace it with bootstrap.servers in config.py
- Create an Schema Registry API Key and replace it with basic.auth.user.info in format <api_key>:<secret_jey> in config.py
- Get the Schema Registry URL for your confluent kafka cluster and replace it with schema_registry in config.py
- Create an Topic name customers_review with schema (AVRO Based) from here
- Replace api_key in config.py with OPENAI-API Key.
Download the Data
- Get the Yelp dataset from here in JSON format.
- Unzip the dataset and place the unzipped folder named yelp_dataset in datasets
Run the services

cd src

docker-compose up -d --build

Start the Socket Server

docker exec -it spark-master python jobs/streaming-sockets.py

Start the Spark Streaming Job

docker exec -it spark-master spark-submit \
--master spark://spark-master:7077 \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
jobs/spark-streaming.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-Socket Streaming

Architecture

How to Run

About

Releases

Packages

Languages

License

keenborder786/RealTime_Spark_Socket_Streaming

Folders and files

Latest commit

History

Repository files navigation

Spark-Socket Streaming

Architecture

How to Run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages