Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/Example for training KFP v1 #2118

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
03fa850
Adds a first draft of a kfpv1-metricscollector
Feb 9, 2023
8918473
Use PodName as input
votti Feb 10, 2023
fd53d85
Adds example for tuning a kfp v1 pipeline with Katib
votti Feb 15, 2023
e9a0051
Adds python < 3.11 compatiblity
votti Feb 15, 2023
17123d6
Add histogram equalization before rescaling
votti Feb 15, 2023
4f19db8
Update copyright date
votti Mar 16, 2023
9f83b0f
Update python version
votti Mar 16, 2023
61e77ea
Publish the docker image in kubeflowkatib
votti Mar 16, 2023
88c20c3
Fix suggested typo fixes
votti Jun 21, 2023
904d07d
Move KFP V1 metrics collector docker files to v1 subfolder
votti Jun 21, 2023
31655dd
Support loading of folder of metrics collector files
votti Jun 21, 2023
c458541
Move kfpv1 metricscollector in v1 subfolder
votti Jun 21, 2023
cee9970
Remove duplicated notebook section
votti Jun 21, 2023
f7e697b
Add dependencies for KFPv1 e2e testing
Jul 18, 2023
36ed372
TMP: changes to run tests locally
Jul 18, 2023
15c4a4b
Add missing ClusterRole update
Jul 18, 2023
741059f
Remove accidentally included `self`
Jul 18, 2023
7d33b7b
Rename paramater to more meaningful name
Jul 18, 2023
35df815
Extend example notebook with simple example for e2e tests
Jul 20, 2023
0504085
Revert "TMP: changes to run tests locally"
Jul 20, 2023
4cddd3e
Adds spec of a simple kfp1+katib experiment spec
Jul 20, 2023
6a0bdd3
Update psutil version to fix Docker build error
Jul 21, 2023
182b787
Move kubeflow installation after katib
Sep 12, 2023
9fc7c02
Parametrize kubeflow version
Sep 12, 2023
579546c
Add `namespace` parameter
Sep 12, 2023
582a6a7
Add kfpv1 e2e test
Sep 12, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions .github/workflows/e2e-test-kfpv1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
name: E2E Test with kubeflow pipelines v1

on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

jobs:
e2e:
runs-on: ubuntu-20.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v3

- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: "3.10"

- name: Run e2e test with ${{ matrix.experiments }} experiments
uses: ./.github/workflows/template-e2e-test
with:
experiments: ${{ matrix.experiments }}
training-operator: true
# Comma Delimited
trial-images: kfpv1-metrics-collector
install-kfp: 1.8.1
experiment-namespace: kubeflow

strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.23.13", "v1.24.7", "v1.25.3"]
# Comma Delimited
experiments:
- "katib-kfp-example-e2e-v1"
2 changes: 2 additions & 0 deletions .github/workflows/publish-core-images.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,5 @@ jobs:
dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile
- component-name: tfevent-metrics-collector
dockerfile: cmd/metricscollector/v1beta1/tfevent-metricscollector/Dockerfile
- component-name: kfpv1-metrics-collector
dockerfile: cmd/metricscollector/v1beta1/kfp-metricscollector/v1/Dockerfile
13 changes: 11 additions & 2 deletions .github/workflows/template-e2e-test/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,15 @@ inputs:
required: false
description: mysql or postgres
default: mysql
install-kfp:
required: false
description: whether kubeflow pipelines is required
as a dependency. If so provide version as string (eg 1.8.1)
default: false
experiment-namespace:
required: false
description: namespace to execute test experiment in
default: default

runs:
using: composite
Expand All @@ -31,8 +40,8 @@ runs:

- name: Setup Katib
shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh ${{ inputs.katib-ui }} ${{ inputs.training-operator }} ${{ inputs.database-type }}
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh ${{ inputs.katib-ui }} ${{ inputs.training-operator }} ${{ inputs.database-type }} ${{ inputs.install-kfp }}

- name: Run E2E Experiment
shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }}
run: ./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }} ${{ inputs.experiment-namespace }}
24 changes: 24 additions & 0 deletions cmd/metricscollector/v1beta1/kfp-metricscollector/v1/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM python:3.10-slim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's use the same structure as for other metrics collectors:
cmd/metricscollector/v1beta1/kfp-metricscollector/Dockerfile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich My concern is where we will put Dockerfile for KFP v2. So I would suggest we put Dockerfile for the KFP v1 on here.
wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Do we really need to support KFP v1 if, eventually, every Kubeflow users should migrate to KFP v2 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because KFP v1 and KFP v2 aren't compatible, I think migrating v1 to v2 is harder in production.
So I guess users need a lot of time to update the version.

Hence, supporting KFP v1 in Katib would be useful. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in any case I still have a question ( #2118 (review)) why do we need separate Metrics Collector for KFP if we need to just read the logs from the metrics file ?

Copy link
Member

@andreyvelich andreyvelich Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for restricting metrics file configuration in one line?

@zijianjoy Katib metrics collector parses metrics file line by line and expects metrics name and value to be located in a single line.

@votti from the log line I can see that metrics are written to /tmp/argo/outputs/artifacts/mlpipeline-metrics.tgz file, isn't it ?

Btw: If you wondering why I dont just use the Stdout collector and in addition print the metrics to the log: this is because this also broke the argo command:

@votti Yeah, this could be an issue since we override the start command to make sure we redirect StdOut to /var/log/katib/metrics.log file, so Katib Metrics Collector can parse this file. Otherwise, Metrics Collector can't parse the StdOut. The main differences between StdOut and File metrics collector is that StdOut tails /var/log/katib/metrics.log file and prints logs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metrics file:
One of the complexities kubeflow pipeline manages is to handle output artifacts (usually compressing them and storing them and saving them to an s3 storage). This is what seems to be broken when using the filecollector, as something while compressing and copying the file to /tmp/argo/outputs/artifacts/mlpipeline-metrics.tgz seems to go wrong.

After finding some time to look into it, I think the reason is very similar to the stdout collector:
The collector modifies the argo CMD/ARG in a way that I think causes these issues:

From the pod definition: Unmodified (eg when using the kubeflow custom metrics collector):

...
      _outputs = train_e2e(**_parsed_args)
      
    Args:

      --input-nr
      /tmp/inputs/input_nr/data
      --lr
      0.0005293023468535503
      --optimizer
      Adam
      --loss
      categorical_crossentropy
      --epochs
      3
      --batch-size
      36
      --mlpipeline-metrics
      /tmp/outputs/mlpipeline_metrics/data

When using the filecollector as metrics collector:

...
      _outputs = train_e2e(**_parsed_args)
       --input-nr /tmp/inputs/input_nr/data --lr 0.00021802007326291811 --optimizer Adam --loss categorical_crossentropy --epochs 3 --batch-size 53 --mlpipeline-metrics /tmp/outputs/mlpipeline_metrics/data && echo completed > /tmp/outputs/mlpipeline_metrics/$$$$.pid

I think this could be solved by following this proposal: #2181
Until this is fixed, I think having a custom metrics collector that does not modify the command is a necessary workaround.

Copy link
Member

@andreyvelich andreyvelich Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@votti I think this also could be solved with this feature, isn't: #577 ?
Basically, we can use Katib SDK to implement API for pushing metrics for Katib DB instead of using pull-based metrics collectors which require to change entrypoint.

User will require to report metrics in their Objective Training function.

For example:

import kubeflow.katib as katib

client = katib.KatibClient()
client.report(metrics={"accuracy": 0.9, "loss": 0.01"})

We might need to do additional changes to Katib controller to verify that metrics were reported by user.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Push based metrics collection: that sounds like a good potential solution!
So the KatibClient can infer automatically which trial these metrics are associated with?

Copy link
Member

@andreyvelich andreyvelich Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@votti Currently, user can get the Trial name using ${trialSpec.Name} template in their Trial's Pod environment vars. Then, user can run KatibClient API with appropriate Trial Name to insert metrics to the Katib DB.
I think, we should always add TRIAL_NAME env to the Trial pod since it is useful for many use-cases (e.g. for exporting trained model to S3, saving Trial metrics to DB, etc.)
WDYT @tenzen-y @johnugeorge @votti ?


ARG TARGETARCH
ENV TARGET_DIR /opt/katib
ENV METRICS_COLLECTOR_DIR cmd/metricscollector/v1beta1/kfp-metricscollector/v1
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/metricscollector/v1beta1/kfp-metricscollector/v1::${TARGET_DIR}/pkg/metricscollector/v1beta1/common/

ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${METRICS_COLLECTOR_DIR}/ ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}/

WORKDIR ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}

RUN if [ "${TARGETARCH}" = "arm64" ]; then \
apt-get -y update && \
apt-get -y install gfortran libpcre3 libpcre3-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*; \
fi

RUN pip install --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR}

ENTRYPOINT ["python", "main.py"]
101 changes: 101 additions & 0 deletions cmd/metricscollector/v1beta1/kfp-metricscollector/v1/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Copyright 2023 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import os
from logging import INFO, StreamHandler, getLogger

import api_pb2
import const
import grpc
from metrics_loader import MetricsCollector
from pns import WaitMainProcesses

timeout_in_seconds = 60


def parse_options():
parser = argparse.ArgumentParser(
description="KFP V1 MetricsCollector", add_help=True
)

# TODO (andreyvelich): Add early stopping flags.
parser.add_argument("-s-db", "--db_manager_server_addr", type=str, default="")
parser.add_argument("-t", "--pod_name", type=str, default="")
parser.add_argument(
"-path",
"--metrics_file_dir",
type=str,
default=const.DEFAULT_METRICS_FILE_KFPV1_DIR,
)
parser.add_argument("-m", "--metric_names", type=str, default="")
parser.add_argument("-o-type", "--objective_type", type=str, default="")
parser.add_argument("-f", "--metric_filters", type=str, default="")
parser.add_argument(
"-p", "--poll_interval", type=int, default=const.DEFAULT_POLL_INTERVAL
)
parser.add_argument(
"-timeout", "--timeout", type=int, default=const.DEFAULT_TIMEOUT
)
parser.add_argument(
"-w", "--wait_all_processes", type=str, default=const.DEFAULT_WAIT_ALL_PROCESSES
)
opt = parser.parse_args()
return opt


if __name__ == "__main__":
logger = getLogger(__name__)
handler = StreamHandler()
handler.setLevel(INFO)
logger.setLevel(INFO)
logger.addHandler(handler)
logger.propagate = False
opt = parse_options()
wait_all_processes = opt.wait_all_processes.lower() == "true"
db_manager_server = opt.db_manager_server_addr.split(":")
trial_name = "-".join(opt.pod_name.split("-")[:-1])
if len(db_manager_server) != 2:
raise Exception(
"Invalid Katib DB manager service address: %s" % opt.db_manager_server_addr
)

WaitMainProcesses(
pool_interval=opt.poll_interval,
timout=opt.timeout,
wait_all=wait_all_processes,
completed_marked_dir=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set None to completed/marked_dir? Can we set opt.metrics_file_dir instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation on this is a bit sparse, but if I understand the code right, this would require the Kubeflow pipeline to write a file <pid>.pid with some TRAINING_COMPLETED text into this directory which it does not do:

if completed_marked_dir:
mark_file = os.path.join(completed_marked_dir, "{}.pid".format(pid))
# Check if file contains "completed" marker
with open(mark_file) as file_obj:
contents = file_obj.read()
if contents.strip() != const.TRAINING_COMPLETED:
raise Exception(
"Unable to find marker: {} in file: {} with contents: {} for pid: {}".format(
const.TRAINING_COMPLETED, mark_file, contents, pid))
# Add main pid to finished pids set

So I think None is correct here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Let me check.

)

mc = MetricsCollector(opt.metric_names.split(";"))
observation_log = mc.parse_file(opt.metrics_file_dir)

channel = grpc.beta.implementations.insecure_channel(
db_manager_server[0], int(db_manager_server[1])
)

with api_pb2.beta_create_DBManager_stub(channel) as client:
logger.info(
"In "
+ trial_name
+ " "
+ str(len(observation_log.metric_logs))
+ " metrics will be reported."
)
client.ReportObservationLog(
api_pb2.ReportObservationLogRequest(
trial_name=trial_name, observation_log=observation_log
),
timeout=timeout_in_seconds,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
psutil==5.9.4
rfc3339>=6.2
grpcio==1.41.1
googleapis-common-protos==1.6.0
protobuf==3.20.0
14 changes: 11 additions & 3 deletions examples/v1beta1/kubeflow-pipelines/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
The following examples show how to use Katib with
[Kubeflow Pipelines](https://github.com/kubeflow/pipelines).

Two different aspects are illustrated here:
A) How to orchestrate Katib experiments from Kubeflow pipelines using the Katib Kubeflow Component (Example 1 & 2)
B) How to use Katib to tune parameters of Kubeflow pipelines

You can find the Katib Component source code for the Kubeflow Pipelines
[here](https://github.com/kubeflow/pipelines/tree/master/components/kubeflow/katib-launcher).

Expand All @@ -13,6 +17,8 @@ You have to install the following Python SDK to run these examples:
- [`kfp`](https://pypi.org/project/kfp/) >= 1.8.12
- [`kubeflow-katib`](https://pypi.org/project/kubeflow-katib/) >= 0.13.0

In order to run parameter tuning over Kubeflow pipelines, additionally Katib needs to be setup to run with Argo workflow tasks. The setup is described within the example notebook (3).

## Multi-User Pipelines Setup

The Notebooks examples run Pipelines in multi-user mode and your Kubeflow Notebook
Expand All @@ -25,10 +31,12 @@ to give an access Kubeflow Notebook to run Kubeflow Pipelines.

The following Pipelines are deployed from Kubeflow Notebook:

- [Kubeflow E2E MNIST](kubeflow-e2e-mnist.ipynb)
1) [Kubeflow E2E MNIST](kubeflow-e2e-mnist.ipynb)

2) [Katib Experiment with Early Stopping](early-stopping.ipynb)

- [Katib Experiment with Early Stopping](early-stopping.ipynb)
3) [Tune parameters of a `MNIST` kubeflow pipeline with Katib](kubeflow-kfpv1-opt-mnist.ipynb)

The following Pipelines have to be compiled and uploaded to the Kubeflow Pipelines UI:
The following Pipelines have to be compiled and uploaded to the Kubeflow Pipelines UI for examples 1 & 2:

- [MPIJob Horovod](mpi-job-horovod.py)
Loading