This is an SQL extension to the mlinspect framework to transpile Python library functions to SQL for execution within a database system.
Prerequisite: Python 3.8
-
Clone this repository
-
Set up the environment
python -m venv venv
source venv/bin/activate
-
If you want to use the visualisation functions we provide, install graphviz which can not be installed via pip
Linux:
apt-get install graphviz
MAC OS:
brew install graphviz
-
Install pip dependencies
pip install -e .[dev]
-
To ensure everything works, you can run the tests (without graphviz, the visualisation test will fail)
python setup.py test
We prepared two examples, the first is to demonstrate execution of machine learning pipelines only, the second demonstrate a full end-to-end machine learning pipeline that compares the performance of different backends.
In order to run the latter one, you need a PostgreSQL database system running (at port 5432) in the background with an user luca
with password password
that is allowed to copy from CSV files and has access to the respective database. (https://www.postgresql.org/download/linux/ubuntu/)
# After intalling:
sudo -i -u postgres
psql
create user luca;
alter role luca with password 'password';
grant pg_read_server_files to luca;
create database healthcare_benchmark;
grant all privileges on database healthcare_benchmark to luca;
To also run the benchmarks in Umbra, you need an Umbra server running at port 5433.
For more information on the functions supported w.r.t execution outsourced to DBMS, please see here.
mlinspect makes it easy to analyze your pipeline and automatically check for common issues.
from mlinspect import PipelineInspector
from mlinspect.inspections import MaterializeFirstOutputRows
from mlinspect.checks import NoBiasIntroducedFor
IPYNB_PATH = ...
inspector_result = PipelineInspector\
.on_pipeline_from_ipynb_file(IPYNB_PATH)\
.add_required_inspection(MaterializeFirstOutputRows(5))\
.add_check(NoBiasIntroducedFor(['race']))\
.execute()
extracted_dag = inspector_result.dag
dag_node_to_inspection_results = inspector_result.dag_node_to_inspection_results
check_to_check_results = inspector_result.check_to_check_results
With execution outsourced to a Database Management System (DBMS):
from mlinspect.to_sql.dbms_connectors.postgresql_connector import PostgresqlConnector
from mlinspect import PipelineInspector
from mlinspect.inspections import MaterializeFirstOutputRows
from mlinspect.checks import NoBiasIntroducedFor
dbms_connector = PostgresqlConnector(...)
IPYNB_PATH = ...
inspector_result = PipelineInspector\
.on_pipeline_from_ipynb_file(IPYNB_PATH)\
.add_required_inspection(MaterializeFirstOutputRows(5))\
.add_check(NoBiasIntroducedFor(['race']))\
.execute_in_sql(dbms_connector=dbms_connector, mode="VIEW", materialize=True)
extracted_dag = inspector_result.dag
dag_node_to_inspection_results = inspector_result.dag_node_to_inspection_results
check_to_check_results = inspector_result.check_to_check_results