-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/Example for training KFP v1 #2118
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@votti Thanks for creating this PR, and sorry for the late response.
I left a few review comments for the first pass.
currently the docker image with the custom metrics collector lives in my personal docker. Maybe it would be better to add it to the kubeflow organisation? What would the steps be to achieve this?
Yes, we should build and push the image to our container registry.
You can add the image name and dockerfile path to the following:
katib/.github/workflows/publish-core-images.yaml
Lines 19 to 34 in b6afce7
strategy: | |
fail-fast: false | |
matrix: | |
include: | |
- component-name: katib-controller | |
dockerfile: cmd/katib-controller/v1beta1/Dockerfile | |
- component-name: katib-db-manager | |
dockerfile: cmd/db-manager/v1beta1/Dockerfile | |
- component-name: katib-ui | |
dockerfile: cmd/new-ui/v1beta1/Dockerfile | |
- component-name: cert-generator | |
dockerfile: cmd/cert-generator/v1beta1/Dockerfile | |
- component-name: file-metrics-collector | |
dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile | |
- component-name: tfevent-metrics-collector | |
dockerfile: cmd/metricscollector/v1beta1/tfevent-metricscollector/Dockerfile |
Moreover, can you add a building script to https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/build.sh and https://github.com/kubeflow/katib/blob/master/scripts/v1beta1/push.sh?
pkg/metricscollector/v1beta1/kfpv1-metricscollector/metrics_loader.py
Outdated
Show resolved
Hide resolved
@tenzen-y Thanks a lot for your comments! |
pkg/metricscollector/v1beta1/kfpv1-metricscollector/metrics_loader.py
Outdated
Show resolved
Hide resolved
pkg/metricscollector/v1beta1/kfpv1-metricscollector/metrics_loader.py
Outdated
Show resolved
Hide resolved
pkg/metricscollector/v1beta1/kfpv1-metricscollector/metrics_loader.py
Outdated
Show resolved
Hide resolved
@votti Thanks for your diligent work. We must add an E2E test for this metrics collector. However, I think we can work on it on follow-up PRs. @kubeflow/wg-automl-leads Can you approve CI? |
Hi Kubeflow team, First sorry to not finding time to work on this earlier. "We must add an E2E test for this metrics collector. However, I think we can work on it on follow-up PRs. As far as I know nobody is actively using this (yet), so I think there would be time for adding an E2E together with this PR making it easier to maintain long-term. |
@votti Thanks for the updates! I'll check this PR again later.
Maybe, we need to update
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! I left a few comments.
Also, once adding tests are done, can you send a ping to me?
pool_interval=opt.poll_interval, | ||
timout=opt.timeout, | ||
wait_all=wait_all_processes, | ||
completed_marked_dir=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we set None
to completed/marked_dir
? Can we set opt.metrics_file_dir
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation on this is a bit sparse, but if I understand the code right, this would require the Kubeflow pipeline to write a file <pid>.pid
with some TRAINING_COMPLETED
text into this directory which it does not do:
katib/pkg/metricscollector/v1beta1/common/pns.py
Lines 95 to 104 in f740889
if completed_marked_dir: | |
mark_file = os.path.join(completed_marked_dir, "{}.pid".format(pid)) | |
# Check if file contains "completed" marker | |
with open(mark_file) as file_obj: | |
contents = file_obj.read() | |
if contents.strip() != const.TRAINING_COMPLETED: | |
raise Exception( | |
"Unable to find marker: {} in file: {} with contents: {} for pid: {}".format( | |
const.TRAINING_COMPLETED, mark_file, contents, pid)) | |
# Add main pid to finished pids set |
So I think None
is correct here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Let me check.
pkg/metricscollector/v1beta1/kfp-metricscollector/v1/metrics_loader.py
Outdated
Show resolved
Hide resolved
Closesly modelled after the tfevent-metricscollector. Currently not yet working, as there are issues that the arguments from the `injector_webhoook` are somehow not passed. Addresses: kubeflow#2019
The TrialName can be parse from the pod name. This seems currently a good way to get the trial name. For more discussion see: kubeflow#2109
This example illustrates how a full kfp pipeline can be tuned using Katib. It is based on a metrics collector to collect kubeflow pipeline metrics (kubeflow#2019). This is used as a Custom Collector. Addresses: kubeflow#1914, kubeflow#2019
Before the notebook only worked with Python 3.11. Now it is also tested with 3.10 Also the experiment/run name is extended with a timestamp for easier reruns.
Otherwise the image was binarized, leading to an artifically bad performance.
And remove an old comment
Co-authored-by: axel7083 <[email protected]>
As per suggestion
As suggested in the PR review, the generic case where multiple KFP pipeline metrics files would be present in the output folder is supported. Note that in the current KFP v1 implementation always only one data file is present.
As per suggestion this should make it easier to handle the v2 metrics collector in the future as well
This installs Kubeflow pipelines (KFP) if selected to do so in order to run e2e tests where Katib and KFP interact.
This commit should be removed later
These permissions are required such that the katib-controller can launch argo workflows.
25ac27f
to
7d33b7b
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: votti The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Sorry for not finding more time to work on this. Status e2e tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great contribution @votti!
Please can you let me know why do we need separate Metrics Collector for KFP ?
Why we can't use our default Metrics Collector in a File mode when you can specify file from which we should parse the metrics ?
In that case, Metrics Collector doesn't parse Stdout, and read data from the file.
@@ -0,0 +1,24 @@ | |||
FROM python:3.10-slim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let's use the same structure as for other metrics collectors:
cmd/metricscollector/v1beta1/kfp-metricscollector/Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich My concern is where we will put Dockerfile for KFP v2. So I would suggest we put Dockerfile for the KFP v1 on here.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Do we really need to support KFP v1 if, eventually, every Kubeflow users should migrate to KFP v2 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because KFP v1 and KFP v2 aren't compatible, I think migrating v1 to v2 is harder in production.
So I guess users need a lot of time to update the version.
Hence, supporting KFP v1 in Katib would be useful. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, in any case I still have a question ( #2118 (review)) why do we need separate Metrics Collector for KFP if we need to just read the logs from the metrics file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason for restricting metrics file configuration in one line?
@zijianjoy Katib metrics collector parses metrics file line by line and expects metrics name and value to be located in a single line.
@votti from the log line I can see that metrics are written to /tmp/argo/outputs/artifacts/mlpipeline-metrics.tgz
file, isn't it ?
Btw: If you wondering why I dont just use the Stdout collector and in addition print the metrics to the log: this is because this also broke the argo command:
@votti Yeah, this could be an issue since we override the start command to make sure we redirect StdOut to /var/log/katib/metrics.log
file, so Katib Metrics Collector can parse this file. Otherwise, Metrics Collector can't parse the StdOut. The main differences between StdOut and File metrics collector is that StdOut tails /var/log/katib/metrics.log
file and prints logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@metrics file:
One of the complexities kubeflow pipeline
manages is to handle output artifacts (usually compressing them and storing them and saving them to an s3 storage). This is what seems to be broken when using the filecollector
, as something while compressing and copying the file to /tmp/argo/outputs/artifacts/mlpipeline-metrics.tgz
seems to go wrong.
After finding some time to look into it, I think the reason is very similar to the stdout
collector:
The collector modifies the argo CMD/ARG in a way that I think causes these issues:
From the pod definition: Unmodified (eg when using the kubeflow custom metrics collector):
...
_outputs = train_e2e(**_parsed_args)
Args:
--input-nr
/tmp/inputs/input_nr/data
--lr
0.0005293023468535503
--optimizer
Adam
--loss
categorical_crossentropy
--epochs
3
--batch-size
36
--mlpipeline-metrics
/tmp/outputs/mlpipeline_metrics/data
When using the filecollector
as metrics collector:
...
_outputs = train_e2e(**_parsed_args)
--input-nr /tmp/inputs/input_nr/data --lr 0.00021802007326291811 --optimizer Adam --loss categorical_crossentropy --epochs 3 --batch-size 53 --mlpipeline-metrics /tmp/outputs/mlpipeline_metrics/data && echo completed > /tmp/outputs/mlpipeline_metrics/$$$$.pid
I think this could be solved by following this proposal: #2181
Until this is fixed, I think having a custom metrics collector that does not modify the command is a necessary workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@votti I think this also could be solved with this feature, isn't: #577 ?
Basically, we can use Katib SDK to implement API for pushing metrics for Katib DB instead of using pull-based metrics collectors which require to change entrypoint.
User will require to report metrics in their Objective Training function.
For example:
import kubeflow.katib as katib
client = katib.KatibClient()
client.report(metrics={"accuracy": 0.9, "loss": 0.01"})
We might need to do additional changes to Katib controller to verify that metrics were reported by user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Push based metrics collection: that sounds like a good potential solution!
So the KatibClient
can infer automatically which trial
these metrics
are associated with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@votti Currently, user can get the Trial name using ${trialSpec.Name}
template in their Trial's Pod environment vars. Then, user can run KatibClient
API with appropriate Trial Name to insert metrics to the Katib DB.
I think, we should always add TRIAL_NAME
env to the Trial pod since it is useful for many use-cases (e.g. for exporting trained model to S3, saving Trial metrics to DB, etc.)
WDYT @tenzen-y @johnugeorge @votti ?
This adds a dummy e2e example that can be used to test the main functionality.
This reverts commit 36ed372.
This could be used for e2e testing
Otherwise the patching of the `katib-controller` cluster role would not work.
This enables the user to set th version of the KFP version which should be useful to use this script to install KFP v1 and v2 without additional parameters.
This is required for kubeflow pipelines as I found no easy way to install kubeflow pipelines into the `default` workspace that was previously the hardcoded one. Now the namespace can be passed as a parameter.
2254867
to
cc90afd
Compare
This action should now run the kubeflow pipeline v1 e2e example. This required the extension of the `template-e2e-test` to include parameters to a) install kfp b) select the `kubeflow` namespace (instead of default) to run the tests with.
cc90afd
to
582a6a7
Compare
I tried now to implement an |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This pull request has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
What this PR does / why we need it:
Currently there is no example how to do parameter tuning over whole Kubeflow pipelines (#1914).
This example shows a how parameter tuning with Katib can be done on a Kubeflow pipeline based on the pipeline v1 SDK.
It adds and uses a new Metrics Collector (
kfpv1-metricscollector
, #2019) that is used as a custom metrics collector in the pipeline.The
kfpv1-metricscollector
image is also published to thekatib
docker environment.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #1914, #2019
Checklist: