Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if experiment name is too long the suggestion service can't start #2454

Open
garymm opened this issue Nov 12, 2024 · 17 comments · May be fixed by #2468
Open

if experiment name is too long the suggestion service can't start #2454

garymm opened this issue Nov 12, 2024 · 17 comments · May be fixed by #2468
Assignees
Labels
good first issue Good for newcomers help wanted Extra attention is needed kind/bug

Comments

@garymm
Copy link

garymm commented Nov 12, 2024

What happened?

I provided an experiment name that was 57 characters long.
It got stuck waiting for trials to be created because the suggestion service couldn't be started because the name was more than 63 characters long.

What did you expect to happen?

Katib to pick a valid name for the service.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.3

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
docker.io/kubeflowkatib/katib-controller:v0.17.0

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.17.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@garymm
Copy link
Author

garymm commented Nov 12, 2024

I may be interested in contributing a fix if it would be welcome and some guidance could be provided as to how to go about it.

@andreyvelich
Copy link
Member

Thank you for creating this @garymm!
Yes, it's a great idea, we can just add the Experiment validation here: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/experiment/validator/validator.go#L81

/assign @garymm

Feel free to reach out if you have any questions.

@andreyvelich
Copy link
Member

/remove-label lifecycle/needs-triage

@garymm
Copy link
Author

garymm commented Nov 15, 2024

What exactly should the validaton be though? I think the better fix is to name the experiment service in a way that is guaranteed to be legal. Where might that happen in the code?

@andreyvelich
Copy link
Member

andreyvelich commented Nov 30, 2024

@garymm Our longest algorithm name is: bayesianoptimization and the Suggestion's Deployment name is <Experiment_name>-<Algorithm_Name>.
So I suggest that we validate that Experiment name is no longer than 40 symbols just to be safe.

You just simple need to add this additional check here: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/experiment/validator/validator.go#L85

E.g. || len(instance.name) > 40.

@andreyvelich
Copy link
Member

/good-first-issue

Copy link

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot added good first issue Good for newcomers help wanted Extra attention is needed labels Nov 30, 2024
@AydanPirani
Copy link

Hello! @andreyvelich has a PR been made to fix this issue? If not, I would be open to helping out as well :)

@garymm
Copy link
Author

garymm commented Dec 9, 2024

There are no PRs open that reference this issue.

@andreyvelich
Copy link
Member

@AydanPirani Yes, please feel free to submit a PR
/assign @AydanPirani

@AydanPirani
Copy link

Sorry for the delay, will get this up and going ASAP :)

@AydanPirani
Copy link

Hi @garymm (cc @andreyvelich)! I'm trying to reproduce this locally, and I had a few questions:

  1. What command did you run to get Katib version? I get the following when I run the command you included:
kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"

docker.io/kubeflowkatib/katib-controller:v0.17.0
docker.io/kubeflowkatib/katib-controller:latest
zsh: no such file or directory: docker.io/kubeflowkatib/katib-controller:v0.17.0`

  1. Is there anything specific you did to run the experiment? A link to the yaml file would be nice, so I can also see the same results on my end :)

Thanks!

@garymm
Copy link
Author

garymm commented Dec 17, 2024

I ran:

kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"

Just create an experiment with a long name, such that experiment name + algorithm name > 63 characters.

@andreyvelich
Copy link
Member

@garymm Is right, this command prints the version of Katib controller which is equal to Katib version (e.g. v0.17.0).
@AydanPirani After you make appropriate changes to the validation webhook, you need to build Katib Controller, and re-use this image in your deployment.
That should help you to test your code changes in the local Kind/Minikube cluster.

@AydanPirani
Copy link

AydanPirani commented Dec 18, 2024

@andreyvelich Got it, thanks! Also, can you please point me to the docs page re. deployment? I'm looking at this, but I don't see much about local Docker builds...

For clarity - I got the Docker build to work, I now have an image katib-master (that follows the master branch), how do I run this locally?

Thanks again for the help, I appreciate it!

@andreyvelich
Copy link
Member

@AydanPirani You can find some info here: https://github.com/kubeflow/katib/blob/master/CONTRIBUTING.md#build-from-source-code, but we don't really explain how to update images.
You just need to take your local image and use it in the kustomize installs: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml#L21-L23

After that, you can follow similar steps as for Training Operator to deploy Kind cluster locally and Katib using the standalone Kustomize overlay: https://github.com/kubeflow/training-operator/blob/master/CONTRIBUTING.md#run-a-kubernetes-cluster.

@AydanPirani
Copy link

AydanPirani commented Dec 21, 2024

Hi all,

Still working on a repro from my end.

@garymm Does the following YAML file lead to the same error mentioned above? Just want to confirm that we both see the same thing!

Update: I have a local repro, getting the PR up now.

@AydanPirani AydanPirani linked a pull request Dec 23, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed kind/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants