This demonstration shows how to build a customer analytics dashboard. Sissy-G Toys is a fictitious online retailer for toys and games. The GroundTruth customer analytics application provides marketing, sales and product managers with a one-stop-shop for analytics. The application uses machine learning models for audio transcription, natural language embeddings and sentiment analysis on structured, semi-structured and unstructured data.
Snowpark ML (in public preview) is a python framework for Machine Learning workloads with Snowpark. Currently Snowpark ML provides a model registry (storing ML tracking data and models in Snowflake tables and stages), feature engineering primitives similar to scikit-learn (ie. LabelEncoder, OneHotEncoder, etc.) and support for training and deploying certain model types as well as deployments as user-defined functions (UDFs).
This guide demonstrates how to use Apache Airflow to orchestrate a machine learning pipeline leveraging the Snowpark provider and Snowpark ML for feature engineering and model tracking. While Snowpark ML has its own support for models similar to scikit-learn this code demonstrates a "bring-your-own" model approach showing the use of open-source scikit-learn along with Snowpark ML model registry and model serving in an Airflow task rather than Snowpark user-defined function (UDF).
This demo also shows the use of the Snowflake XCOM backend which supports security and governance by serializing all task in/output to Snowflake tables and stages while storing in the Airflow XCOM table a URI pointer to the data.
This workflow includes:
- sourcing structured, unstructured and semistructured data from different systems
- extract, transform and load with Snowpark Python provider for Airflow
- ingest with Astronomer's python SDK for Airflow
- audio file transcription with OpenAI Whisper
- natural language embeddings with OpenAI Embeddings and the Weaviate provider for Airflow
- vector search with Weaviate
- sentiment classification with LightGBM
- ML model management with Snowflake ML
All of the above are presented in a Streamlit applicaiton.
- Docker Desktop or similar Docker services running locally with the docker CLI installed.
- Astronomer account or Trial Account (optional)
- Snowflake account or Trial Account
- OpenAI account or Trial Account
- Install Astronomer's Astro CLI. The Astro CLI is an Apache 2.0 licensed, open-source tool for building Airflow instances and is the fastest and easiest way to be up and running with Airflow in minutes. Open a terminal window and run:
For MacOS
brew install astro
For Linux
curl -sSL install.astronomer.io | sudo bash -s
- Clone this repository:
git clone https://github.com/astronomer/airflow-snowparkml-demo
cd airflow-snowparkml-demo
- Open the
.env
file in an editor and update the following variables with you account information This demo assumes the use of a new Snowflake trial account with admin privileges. A database named 'DEMO' and schema named 'DEMO' will be created in the DAG. Running this demo without admin privileges or with existing database/schema will require further updates to the.env
file.
- AIRFLOW_CONN_SNOWFLAKE_DEFAULT
-- login
-- password
-- account ** - OPENAI_APIKEY
** The Snowflake account
field of the connection should use the new ORG_NAME-ACCOUNT_NAME
format as per Snowflake Account Identifier policies. The ORG and ACCOUNT names can be found in the confirmation email or in the Snowflake login link (ie. https://xxxxxxx-yyy11111.snowflakecomputing.com/console/login
)
Do not specify a region
when using this format for accounts.
NOTE: Database and Schema names should be CAPITALIZED due to a bug in Snowpark ML.
- Start Apache Airflow:
astro dev restart
A browser window should open to http://localhost:8080
Login with
username: admin
password: admin
- Run the Customer Analytics Demo DAG
astro dev run dags unpause customer_analytics
astro dev run dags trigger customer_analytics
Follow the status of the DAG run in the Airflow UI (username: admin
, password: admin
)
- After the DAG completes look at the customer analytics dashboard in Streamlit.
Streamlit has been installed alongside the Airflow UI in the webserver container.
Connect to the webserver container with the Astro CLI
astro dev bash -w
Start Streamlit
cd include/streamlit/src
python -m streamlit run ./streamlit_app.py
Open the streamlit application in a browser.
Other service UIs are available at the the following:
- Airflow: http://localhost:8080 Username:Password is admin:admin
- Weaviate: https://console.weaviate.io/ Enter localhost:8081 in the "Self-hosted Weaviate" field.
See README.md
for an example of using Snowpark ML libraries for training in a UDF.