Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add kserve raw reconciler #250

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

VedantMahabaleshwarkar
Copy link
Contributor

@VedantMahabaleshwarkar VedantMahabaleshwarkar commented Aug 5, 2024

Description

https://issues.redhat.com/browse/RHOAIENG-10293

How Has This Been Tested?

spec:
  components:
    codeflare:
      managementState: Removed
    kserve:
      defaultDeploymentMode: RawDeployment
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: overlays/odh
            uri: 'https://github.com/opendatahub-io/kserve/tarball/master'
          - contextDir: config
            sourcePath: ''
            uri: 'https://github.com/VedantMahabaleshwarkar/odh-model-controller/tarball/devflags'
      managementState: Managed
      serving:
        ingressGateway:
          certificate:
            type: OpenshiftDefaultIngress
        managementState: Removed
        name: knative-serving
    modelregistry:
      managementState: Removed
    trustyai:
      managementState: Removed
    ray:
      managementState: Removed
    kueue:
      managementState: Removed
    workbenches:
      managementState: Removed
    dashboard:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    datasciencepipelines:
      managementState: Removed
    trainingoperator:
      managementState: Removed

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Signed-off-by: Vedant Mahabaleshwarkar <[email protected]>
Copy link
Contributor

openshift-ci bot commented Aug 5, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Aug 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VedantMahabaleshwarkar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [VedantMahabaleshwarkar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines -41 to -42
NewKServeMetricsServiceReconciler(client),
NewKServeMetricsServiceMonitorReconciler(client),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this break existing metrics functionality for serverless?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got my answer: #250 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I remember is that these reconcilers were left for doing clean-up of some resources that we no longer need (see the old createDesiredResource that just returns nil as the "wanted" resource).

By removing these reconcilers, the clean-up is what will no longer happen for version upgrades. I'm not sure if not doing the clean-up will break something, but I'm quite sure we still want the clean-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old metrics reconcilers for kserve serverless were present pre 2.8 (EUS release)
Kserve Raw support will happen after 2.13 at minimum (2.12 is a EUS if i'm not wrong)
I can keep an "empty reconciliation" loop around, but the old resources should long by the time kserve raw is GA. This was my reasoning for assuming that it's safe to delete the old reconcilers for serverless

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the upgrade path for any install that still has those reconcilers should look like
2.7 or lower -> 2.8 -> 2.12 (at minimum) -> kserve raw GA.
If i'm mistaken in this assumption I'll re-add the cleanup.

}

if !createRoute {
log.Info("InferenceService does not have '" + constants.LabelEnableRoute + "' annotation set to 'True'. Skipping route creation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either set to false or does not exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this, could you elaborate?

}

func (r *KserveRawRouteReconciler) Cleanup(_ context.Context, _ logr.Logger, _ string) error {
// NOOP - resources are deleted together with ISVCs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check existing route that are not created by the controller, e.g. no owner reference?

NoResourceRemoval
client client.Client
serviceMonitorHandler resources.ServiceMonitorHandler
deltaProcessor processors.DeltaProcessor
}

func NewKServeMetricsServiceMonitorReconciler(client client.Client) *KserveMetricsServiceMonitorReconciler {
return &KserveMetricsServiceMonitorReconciler{
func NewRawKServeMetricsServiceMonitorReconciler(client client.Client) *KserveRawMetricsServiceMonitorReconciler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep the same but skips reconciliation for serverless mode?

@@ -17,6 +17,7 @@ package reconcilers

import (
"context"
"github.com/hashicorp/go-multierror"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the next group of imports.

func (r *KserveRawInferenceServiceReconciler) Reconcile(ctx context.Context, log logr.Logger, isvc *kservev1beta1.InferenceService) error {
var reconcileErrors *multierror.Error
for _, reconciler := range r.subResourceReconcilers {
reconcileErrors = multierror.Append(reconcileErrors, reconciler.Reconcile(ctx, log, isvc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also will need to implement OnDeletionOfKserveInferenceService and CleanupNamespaceIfNoKserveIsvcExists.

Copy link
Contributor Author

@VedantMahabaleshwarkar VedantMahabaleshwarkar Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep those changes are made already :D, they will be a part of the next set of commits

Comment on lines -41 to -42
NewKServeMetricsServiceReconciler(client),
NewKServeMetricsServiceMonitorReconciler(client),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I remember is that these reconcilers were left for doing clean-up of some resources that we no longer need (see the old createDesiredResource that just returns nil as the "wanted" resource).

By removing these reconcilers, the clean-up is what will no longer happen for version upgrades. I'm not sure if not doing the clean-up will break something, but I'm quite sure we still want the clean-up.

if err != nil {
return nil, err
}
metricsService := &v1.Service{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Raw, isn't it better to inject the serving.kserve.io/enable-prometheus-scraping: "true" as documented in https://kserve.github.io/website/master/modelserving/observability/prometheus_metrics/ ?

I'm not sure we should be doing our own way, given the push for better API support in ODH.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this puts the onus of enabling metrics for a inferenceservice on the creator of the inferenceservice, rather than the platform. Even if we add logic to inject this in the UI flow, if a user is using yamls then they would have to specifically remember to enable metrics. In ODH/RHOAI, I'm guessing we always want metrics and the user should not have to worry about remembering to enable them.
But I can be convinced otherwise :)
cc @danielezonca @terrytangyuan WDYT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we could enable it by default in the ConfigMap: https://github.com/opendatahub-io/kserve/blob/ae30b971320f45bc98dd362b6b8873d7c1a48873/config/configmap/inferenceservice.yaml#L447. But doing that would also affect Serverless deployments, so not sure if it will impact something.

What I don't know is if this is specific to Serverless, because docs only refer to the Serverless use-case.

In case it works for Raw, I said to inject meaning we could use a mutating webhook in odh-model-controller to inject the annotation on the ISVC. I wasn't meaning that the UI should do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@israel-hdez revisiting this comment, we actively moved away from the kserve native annotations for metrics i.e serving.kserve.io/enable-prometheus-scraping: "true" for Serverless, and we have instead opted for prometheus.io/port and prometheus.io/path annotations on the ServingRuntime instead.
So going back to serving.kserve.io/enable-prometheus-scraping: "true" for RawDeployment would IMO be more confusing. Creating a SeviceMonitor and Service for metrics will allow us to be consistent w.r.t metrics within RHOAI.

}
}

func (r *KserveRawRouteReconciler) Reconcile(ctx context.Context, log logr.Logger, isvc *kservev1beta1.InferenceService) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exposing, KServe supports creating Ingress resources, but we have it turned off. OpenShift can follow-up and create Routes.

Given the push for better supporting the API, rather than doing our own thing, we should try to align better with upstream.

By creating Routes on our side, KServe is no longer aware. So, it won't expose what's the endpoint in status.url of the InferenceService. This will just increase complexity on dashboard code. Also, it will be more complex for users because the status.url field will be useless for Raw, but will be useful for Serverless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice, I realized the thing about the status.url myself but didn't realize an option existed to do this in a way such that Kserve is aware. I'll incorporate this.

Signed-off-by: Vedant Mahabaleshwarkar <[email protected]>
@VedantMahabaleshwarkar
Copy link
Contributor Author

/retest-required

Copy link
Contributor

openshift-ci bot commented Aug 19, 2024

@VedantMahabaleshwarkar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 051e5f8 link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@VedantMahabaleshwarkar
Copy link
Contributor Author

not sure why the images job is failing, I think it's likely a infra issue. Local builds are succeeding

@spolti
Copy link
Member

spolti commented Aug 21, 2024

/hold

@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mwaykole
Copy link
Member

mwaykole commented Aug 30, 2024

hi @VedantMahabaleshwarkar could u pleas point me where is the unit test for this part of code ?

@VedantMahabaleshwarkar
Copy link
Contributor Author

@mwaykole about unit tests:

  • For Kserve RawDeployment Route : According to the conversation here , the route creation will be done by enabling the ingress creation in the inferenceservice-config ConfigMap. So here odh-model-controller is no longer involved. No code, no unit tests
  • For Kserve Raw Route Auth : this PR is not currently targetting that,
  • For Kserve RawDeployment Metrics : the existing test for metrics (controllers/kserve_inferenceservice_controller_metrics_test.go) was modified to add test cases for RawDeployment, the changes are in this PR.

Hope that helps

)

const (
inferenceServiceLabelName = "serving.kserve.io/inferenceservice"
)

var _ SubResourceReconciler = (*KserveMetricsServiceReconciler)(nil)
var _ SubResourceReconciler = (*KserveRawMetricsServiceReconciler)(nil)
Copy link
Member

@mwaykole mwaykole Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we will support Kserve Raw Metrics as well , and it should work same as server less ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants