chainguard-images · imjasonh · May 9, 2024 · May 6, 2024
diff --git a/generated.tf b/generated.tf
diff --git a/images/mlflow/README.md b/images/mlflow/README.md
@@ -0,0 +1,82 @@
+<!--monopod:start-->
+# mlflow
+| | |
+| - | - |
+| **OCI Reference** | `cgr.dev/chainguard/mlflow` |
+
+
+* [View Image in Chainguard Academy](https://edu.chainguard.dev/chainguard/chainguard-images/reference/mlflow/overview/)
+* [View Image Catalog](https://console.enforce.dev/images/catalog) for a full list of available tags.
+* [Contact Chainguard](https://www.chainguard.dev/chainguard-images) for enterprise support, SLAs, and access to older tags.*
+
+---
+<!--monopod:end-->
+
+<!--overview:start-->
+A minimal, [Wolfi](https://github.com/wolfi-dev)-based image for MLflow, an open source platform for the machine learning lifecycle.
+
+<!--overview:end-->
+
+<!--getting:start-->
+## Download this Image
+The image is available on `cgr.dev`:
+
+```
+docker pull cgr.dev/chainguard/mlflow:latest
+```
+<!--getting:end-->
+
+<!--body:start-->
+### MLflow Usage
+
+MLflow's default entrypoint is Python, enabling us to run experiments directly:
+
+```bash
+docker run -it cgr.dev/chainguard/mlflow:latest <your experiment>.py
+```
+
+Otherwise, we can override the entrypoint and interact with MLflow:
+
+```bash
+docker run -it --entrypoint mlflow cgr.dev/chainguard/mlflow:latest <options>
+```
+
+### MLflow Tracking Usage
+
+MLflow provides a UI, MLflow Tracking, that allows the user to track 'runs' (the execution of data science code) via visualizations of metrics, parameters, and artifacts.
+
+To start the UI, open a terminal and run:
+
+```bash
+docker run -it -p 5000:5000 --entrypoint mlflow cgr.dev/chainguard/mlflow:latest ui
+```
+
+While the UI defaults to running on port 5000, you can use a different port via passing `-p <PORT>` as a command line option. Ensure Docker also maps to the correct port.
+
+You should now be able to access the UI at [localhost:5000](http://localhost:5000).
+
+The Tracking API can now be leveraged to record metrics, parameters, and artifacts:
+
+```python
+import mlflow
+
+# Set the MLflow tracking URI
+mlflow.set_tracking_uri("http://localhost:5000")
+
+# Start an experiment
+mlflow.set_experiment("my_experiment")
+
+with mlflow.start_run():
+    # Log parameters, metrics, and artifacts
+    mlflow.log_param("param1", value1)
+    mlflow.log_metric("metric1", value2)
+    mlflow.log_artifact("path/to/artifact")
+    # Train and log model
+    mlflow.sklearn.log_model(model, "model")
+```
+
+Ensure that the tracking URI correctly reflects where the MLflow server is running.
+
+For additional documentation covering MLflow Tracking, see the [official docs](https://mlflow.org/docs/latest/tracking.html).
+
+<!--body:end-->
diff --git a/images/mlflow/TESTING.md b/images/mlflow/TESTING.md
@@ -0,0 +1,55 @@
+# Testing MLflow
+
+Start off by pulling down the image:
+
+```bash
+docker pull cgr.dev/chainguard/mlflow:latest
+```
+
+Now we'll run a quick test to ensure MLflow is detected by Python:
+
+```bash
+docker run -it --rm cgr.dev/chainguard/mlflow:latest -m mlflow
+```
+
+This also validates that we are using the version of Python provided in the virtual environment and not the main Python installation. Because everything is installed within a virtual environment, this is important to verify.
+
+Now we can start MLflow Tracker:
+
+```bash
+docker run -it --rm -w $(pwd) -v $(pwd):$(pwd) -p 5000:5000 --entrypoint mlflow --name mlflow cgr.dev/chainguard/mlflow:latest ui --host 0.0.0.0
+```
+
+By default, this will start on port 5000. We can override this by running the following:
+
+```bash
+docker run -it --rm -w $(pwd) -v $(pwd):$(pwd) -p <PORT>:<PORT> --entrypoint mlflow --name mlflow cgr.dev/chainguard/mlflow:latest ui --host 0.0.0.0 -p <PORT>
+```
+
+Logs aren't all too verbose. The important thing you should see is `Listening on: 0.0.0.0:<PORT>`.
+
+Now let's do a quick health check:
+
+```bash
+curl -vsL localhost:5000/health
+```
+
+The status code should be 200. If all is well, you should be able to access the UI at [localhost:5000](http://localhost:5000).
+
+Now we can test basic functionality of MLflow Tracker. Save this code snippet:
+
+```python
+import mlflow
+
+with mlflow.start_run():
+    for epoch in range(0, 3):
+        mlflow.log_metric(key="quality", value=2 * epoch, step=epoch)
+```
+
+And then execute it:
+
+```bash
+docker exec mlflow python ./test.py
+```
+
+This will create a run with a random name that should now be viewable in MLflow's UI.
diff --git a/images/mlflow/config/main.tf b/images/mlflow/config/main.tf
@@ -0,0 +1,39 @@
+terraform {
+  required_providers {
+    apko = { source = "chainguard-dev/apko" }
+  }
+}
+
+variable "extra_packages" {
+  description = "Additional packages to install."
+  type        = list(string)
+  default     = ["mlflow"]
+}
+
+variable "environment" {
+  default = {}
+}
+
+module "accts" {
+  source = "../../../tflib/accts"
+  run-as = 65532
+  uid    = 65532
+  gid    = 65532
+  name   = "nonroot"
+}
+
+output "config" {
+  value = jsonencode({
+    contents = {
+      packages = var.extra_packages
+    }
+    accounts = module.accts.block
+    environment = merge({
+      "PATH" : "/usr/share/mlflow/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
+    }, var.environment)
+    entrypoint = {
+      command = "/usr/share/mlflow/bin/python3"
+    }
+    work-dir = "/home/nonroot"
+  })
+}
diff --git a/images/mlflow/generated.tf b/images/mlflow/generated.tf
diff --git a/images/mlflow/main.tf b/images/mlflow/main.tf
@@ -0,0 +1,38 @@
+terraform {
+  required_providers {
+    oci = { source = "chainguard-dev/oci" }
+  }
+}
+
+variable "target_repository" {
+  description = "The docker repo into which the image and attestations should be published."
+}
+
+module "config" {
+  source = "./config"
+}
+
+module "latest" {
+  source            = "../../tflib/publisher"
+  name              = basename(path.module)
+  target_repository = var.target_repository
+  config            = module.config.config
+  build-dev         = true
+}
+
+module "test" {
+  source = "./tests"
+  digest = module.latest.image_ref
+}
+
+resource "oci_tag" "latest" {
+  depends_on = [module.test]
+  digest_ref = module.latest.image_ref
+  tag        = "latest"
+}
+
+resource "oci_tag" "latest-dev" {
+  depends_on = [module.test]
+  digest_ref = module.latest.dev_ref
+  tag        = "latest-dev"
+}
diff --git a/images/mlflow/metadata.yaml b/images/mlflow/metadata.yaml
@@ -0,0 +1,13 @@
+name: mlflow
+image: cgr.dev/chainguard/mlflow
+logo: https://storage.googleapis.com/chainguard-academy/logos/mlflow.svg
+endoflife: ""
+console_summary: ""
+short_description: |
+  A minimal, [Wolfi](https://github.com/wolfi-dev)-based image for MLflow, an open source platform for the machine learning lifecycle.
+compatibility_notes: ""
+readme_file: README.md
+upstream_url: https://mlflow.org/
+keywords:
+  - ai
+  - python
diff --git a/images/mlflow/tests/check-mlflow.sh b/images/mlflow/tests/check-mlflow.sh
@@ -0,0 +1,43 @@
+#!/usr/bin/env bash
+
+set -o errexit -o nounset -o errtrace -o pipefail -x
+
+# Random port is needed in multi-image test environments
+PORT=$(shuf -i 1024-65535 -n 1)
+CONTAINER_NAME="mlflow-${PORT}"
+
+# Start MLflow Tracker
+docker run \
+  -d --rm \
+  -v ./tmp/tests:/tmp/tests \
+  -p "${PORT}":"${PORT}" \
+  --name "${CONTAINER_NAME}" \
+  --entrypoint mlflow \
+  "${IMAGE_NAME}" \
+  ui --host 0.0.0.0 -p "${PORT}"
+
+# Stop container when script exits
+trap "docker logs "${CONTAINER_NAME}" && docker stop ${CONTAINER_NAME}" EXIT
+
+# Check MLflow Tracker availability
+check_ui_status() {
+  local request_retries=10
+  local retry_delay=5
+
+  # Install curl
+  apk add curl
+
+  # Check availability
+  for ((i=1; i<=${request_retries}; i++)); do
+    if [ "$(docker run --network container:"${CONTAINER_NAME}" cgr.dev/chainguard/curl:latest -o /dev/null -s -w "%{http_code}" "http://localhost:${PORT}/health")" -eq 200 ]; then
+      return 0
+    fi
+    sleep "${retry_delay}"
+  done
+
+  echo "FAILED: Did not receive 200 HTTP response from Tracker after ${request_retries} attempts."
+  exit 1
+}
+
+# Run tests
+check_ui_status
diff --git a/images/mlflow/tests/linear_regression.py b/images/mlflow/tests/linear_regression.py
@@ -0,0 +1,56 @@
+from pprint import pprint
+
+import numpy as np
+from sklearn.linear_model import LinearRegression
+
+import mlflow
+from mlflow.tracking import MlflowClient
+
+
+def yield_artifacts(run_id, path=None):
+    """Yield all artifacts in the specified run"""
+    client = MlflowClient()
+    for item in client.list_artifacts(run_id, path):
+        if item.is_dir:
+            yield from yield_artifacts(run_id, item.path)
+        else:
+            yield item.path
+
+
+def fetch_logged_data(run_id):
+    """Fetch params, metrics, tags, and artifacts in the specified run"""
+    client = MlflowClient()
+    data = client.get_run(run_id).data
+    # Exclude system tags: https://www.mlflow.org/docs/latest/tracking.html#system-tags
+    tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
+    artifacts = list(yield_artifacts(run_id))
+    return {
+        "params": data.params,
+        "metrics": data.metrics,
+        "tags": tags,
+        "artifacts": artifacts,
+    }
+
+
+def main():
+    # enable autologging
+    mlflow.sklearn.autolog()
+
+    # prepare training data
+    X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
+    y = np.dot(X, np.array([1, 2])) + 3
+
+    # train a model
+    model = LinearRegression()
+    model.fit(X, y)
+    run_id = mlflow.last_active_run().info.run_id
+    print(f"Logged data and model in run {run_id}")
+
+    # show logged data
+    for key, data in fetch_logged_data(run_id).items():
+        print(f"\n---------- logged {key} ----------")
+        pprint(data)
+
+
+if __name__ == "__main__":
+    main()