This project contains the whole pipeline for recording messages from slack channels to AWS and Hadoop
Abstract: Getting message and discussion from slack chennel and process it to S3 and HDFS in realtime and set the batch job for any query or processing for every 6 hr - data in HDFS to Hive and data in S3 to another S3 or RDS in AWS
SlackMachine: Bot written in python using the slack-machine library. Produces messages from channels it resides in to both a kafka topic and a kinesis stream.
- Create bot for your workspace by adding a Custom bot integration.
- Add bot to channels of your choice
- Download SlackMachine directory
- Create virtual environment with:
virtualenv --python=/path/to/python/version/3.7 venv
- Activate virtual environment with:
source venv/bin/activate
- Change directory to SlackMachine:
cd /path/to/SlackMachine
- Install dependencies:
pip install -r requirements.txt
- Set environment variable for slack api token for the bot created above, which can be found in the manage custom configurations menu:
export SLACK_API_TOKEN=<your token here>
- Set environment variable for kafka bootstrap server:
export BOOTSTRAP_SERVERS=<your broker address here>
- Run the bot:
slack-machine
Spark Consumer: Spark streaming consumer written in Scala. Gets records from the kafka stream and parses them down to a dataframe consisting of screen_name, user_id, channel, time, and text. Data is then written to hdfs in parquet format in partitions of Date=[YYYYMMdd]/Hour=[HH].
Airflow & Hive: Airflow dag that calls a Hive script which creates an external table over spark output if it does not already exist and then looks for new partitions to add.
Lambda: Triggers on data being added to S3. Reads the json data and extracts necessary values and then sends it to and RDS MySQL instance.
Update (4/29/19):
This program utilizes a library called slack-machine
which can be installed and set up using this link:
https://slack-machine.readthedocs.io/en/latest/user/install.html
A quick synopsis can be as follows:
- Set up a virtual environment for python.
a. Make a directory for your bot.
b. In that directory, use
virtualenv <name of env>
. c. To use it, usesource ./bin/activate
. - Install packages with pip.
pip install slack-machine kafka-python boto3
- In the folder, create a file called
local-settings.py
where you can store the tokens and plugins for the bot (more on that soon). - Create a folder called
plugins
and go there. Once there, usetouch __init__py
to initialize any code that may be written here.
From there, your bot logic can be written in a separate file. Please refer to https://slack-machine.readthedocs.io/en/latest/plugins/basics.html to get started on creating plugins.