messari-coding-challenge

Implementation of RSS Feed scraper & REST API for Messari Back End role

Quickstart

Install Necessary Software

This project uses Docker, it is neccesary to launch and run the system.

Optional Software

This project includes a Postman collection to easily make requests to different REST endpoints. This helps save time, makes basic testing easy and makes actions repeatable.

Environment Variables

We use a .env file to inject environment variables into docker containers

In the top level directory, create a .env file and copy the sample vars below. if this were production, we would have multiple environment files e.g. .env.dev or .env.prod

touch .env
REDIS_HOST="redis"
REDIS_PORT=6379
POSTGRES_CONNECTION_STRING="postgresql+psycopg2://postgres:postgres@db:5432"

Start the application

This application is configured to run within a handful of docker containers.

Start it by running
docker compose up

Note: Due to the way that docker-compose depends-on works, which is to only wait for the starting of a service instead of it's start completion, it is possible that the Flask API will retry starting up if the DB is not yet finished booting.

Development

Note: Copy Environment variables from .vscode/launch.json as they are slightly different than the ones used for Docker.

This repository uses python venv for local development.

Create the virtual environment, activate it and install requirements.

python3 -m venv venv
. venv/bin/activate
cd app
pip3 install -r requirements.txt

Run the docker containers of neccessary depedencies, alternatively you could run postgres & redis locally.
docker compose up redis db

Run the flask app, I like using the VSCode debugger to do this, but you can also do it in the CLI by running
flask --app app.py --debug run

Technologies Used

All code in this repository is written in Python3.

REST API - Flask
Data Storage - PostgresSQL
ORM - SqlAlchemy/psycopg2
Caching - Redis
Streaming - flask-sse
Web Server - gunicorn + nginx proxy
Web Scraping - feedparser for RSS & beautifulsoup using lxml for HTML

DB Models

API Documentation

Article Routes
- /articles retrieves all articles in the DB
- /article/<article_id> retrieves a single article by URL
- /articles/pattern retrieves all articles for a pattern_id
News Source Routes
- /source/create creates a new News Source and parses it
- /sources retrieves all News Sources in the DB
- /source/<source_id> retrieves a single News Source by ID
- /source/update/ updates a News Source by ID
- /source/delete deletes a News Source by ID
Pattern Routes
- /pattern/create creates a new Pattern and looks for new matches
- /patterns retrieves all patterns in the DB
- /pattern/<pattern_id> retrieves a single pattern by ID
- /pattern/article/ retrieves all article that match Pattern
- /pattern/update/ updates a pattern and reprocesses it
- /pattern/delete deletes a pattern and any pattern matches associated
ETL Routes
- /feed/run_etl runs the ETL feed -- used by a shell script to periodically run the feed. This implementation is not ideal and was chosen in the interest of simplicity and time. See notes below for more detail
  - Ideally this feed would be triggered without interacting with the Flask API whatsoever, instead managed by a workflow orchestrater like Apache Airflow.
  - Benefits of this approach would include more robust re-try mechanisms, job tracking and a seperation of the ETL feed from the REST API in the event the API ever went down, the feed would be unaffected.
Streaming Routes
- /stream/subscribe?channel=<pattern_id|news_source_id> creates an SSE subscription channel for the specified pattern or news_source
  - I chose SSE over WebSockets because theres no need for subscribers to communicate with the Rest API other than subscription initiation

Next Steps assuming more development time

Add CI/CD pipeline
Add Unit & Integration level testing
Use Workflow Orchestration for managing ETL feed
Increasing performance/scale
- multi-threading server requests to increase API throughput
- Database replication & sharding
- Horizontally scale web crawling processes & API
Collect analytics on article processing/querying
Enhance streaming capabilities to support more combinations of Patterns/New Sources
Add Server Side rendering for raw HTML to ensure we are appropriately parsing dynamic content
Support for more types of pattern matching & querying
Auto-generated documentation via flask-restx/swagger

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.vscode		.vscode
app		app
docs		docs
nginx		nginx
postman		postman
run_etl		run_etl
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

messari-coding-challenge

Table of Contents

Quickstart

Install Necessary Software

Optional Software

Environment Variables

Start the application

Development

Technologies Used

DB Models

API Documentation

Next Steps assuming more development time

About

Releases

Packages

Languages

craigjson/messari-coding-challenge

Folders and files

Latest commit

History

Repository files navigation

messari-coding-challenge

Table of Contents

Quickstart

Install Necessary Software

Optional Software

Environment Variables

Start the application

Development

Technologies Used

DB Models

API Documentation

Next Steps assuming more development time

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages