Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk] enable_caching breaks when using CreatePVC: must specify FingerPrint #10188

Open
TobiasGoerke opened this issue Oct 31, 2023 · 17 comments
Open

Comments

@TobiasGoerke
Copy link
Contributor

TobiasGoerke commented Oct 31, 2023

Environment

  • KFP version:
    2.0.3 (manifests v1.8 release)
  • KFP SDK version:
kfp                      2.4.0
kfp-kubernetes           1.0.0
kfp-pipeline-spec        0.2.2
kfp-server-api           2.0.3

Steps to reproduce

Given the following example:

from kfp import dsl
from kfp import kubernetes


@dsl.component
def test_step():
    print("Hello world")


@dsl.pipeline
def test_pipeline():
    kubernetes.CreatePVC(
        access_modes=["ReadWriteOnce"],
        size="10Mi",
        storage_class_name="default",
    )
    test_step()


client.create_run_from_pipeline_func(test_pipeline, arguments={}, enable_caching=False)

The pipeline will fail. Note the enable_caching, which will cause the issue when set to False.

We will see an error in the created PVC step:

F1031 14:29:54.216337 27 main.go:76] KFP driver: driver.Container(pipelineName=test-pipeline, runID=02ad61d6-8b9b-47a7-b626-0d65f3838b42, task="createpvc", component="comp-createpvc", dagExecutionID=9094, componentSpec) failed: failed to create PVC and publish execution createpvc: failed to create cache entrty for create pvc: failed to create task: rpc error: code = InvalidArgument desc = Failed to create a new task due to validation error: Invalid input error: Invalid task: must specify FingerPrint
time="2023-10-31T14:29:54.940Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2023-10-31T14:29:54.940Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2023-10-31T14:29:54.940Z" level=error msg="cannot save parameter /tmp/outputs/cached-decision" argo=true error="open /tmp/outputs/cached-decision: no such file or directory"
time="2023-10-31T14:29:54.940Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"
Error: exit status 1

Impacted by this bug? Give it a 👍.

@zijianjoy
Copy link
Collaborator

@TobiasGoerke what is the version of your KFP runtime? Maybe there is a bug when resolving cache key in the PVC creation operation. cc @chensun to learn more.

@TobiasGoerke
Copy link
Contributor Author

@TobiasGoerke what is the version of your KFP runtime? Maybe there is a bug when resolving cache key in the PVC creation operation. cc @chensun to learn more.

I'm on manifests/v1.8-branch, i.e. 2.0.3.

@yingding
Copy link

I am also facing the exactly same issue with the same output on KFP backend 2.0.3 with Kubeflow 1.8.0 manifests deployment.
The PVC is created, but the component reported the error from the logs and exist with error.

F1117 21:35:33.015147      22 main.go:76] KFP driver: driver.Container(pipelineName=my-pipeline, runID=cd147529-1b6c-454b-b3e1-b2858ff98222, task="createpvc", component="comp-createpvc", dagExecutionID=29, componentSpec) failed: failed to create PVC and publish execution createpvc: failed to create cache entrty for create pvc: failed to create task: rpc error: code = InvalidArgument desc = Failed to create a new task due to validation error: Invalid input error: Invalid task: must specify FingerPrint
time="2023-11-17T21:35:33.321Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/pod-spec-patch" argo=true error="open /tmp/outputs/pod-spec-patch: no such file or directory"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/cached-decision" argo=true error="open /tmp/outputs/cached-decision: no such file or directory"
time="2023-11-17T21:35:33.322Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"
Error: exit status 1

@yingding
Copy link

Just want to add some additional info. After experiencing this issue, kfp backend didn't work anymore in my case.
I have to restart all the deployments kubectl -n kubeflow rollout restart deployments to be able to run v2 pipeline again.

@yingding
Copy link

yingding commented Jan 3, 2024

With the api-server 2.0.5 with enable_caching=False, this issue still exists.

  • KFP Backend API-SERVER version:
    2.0.5 (manifests v1.8 release modified)
  • KFP SDK version:
kfp                      2.4.0
kfp-kubernetes           1.0.0
kfp-pipeline-spec        0.2.2
kfp-server-api           2.0.5

@kabartay
Copy link

With the api-server 2.0.5 with enable_caching=False, this issue still exists.

  • KFP Backend API-SERVER version:
    2.0.5 (manifests v1.8 release modified)
  • KFP SDK version:
kfp                      2.4.0
kfp-kubernetes           1.0.0
kfp-pipeline-spec        0.2.2
kfp-server-api           2.0.5

@yingding finally, it's working fine?

@yingding
Copy link

@kabartay Unfortunately, this issue still exists, even with

  • KFP Backend API-SERVER version:
    2.0.5 (manifests v1.8 release modified)
  • KFP SDK version:
kfp                           2.6.0
kfp-kubernetes                1.1.0
kfp-pipeline-spec             0.3.0
kfp-server-api                2.0.5

Hopefully, it can be resolved in the next KFP backend API SERVER.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 30, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@AnnKatrinBecker
Copy link

/reopen

Seems this issue has not been resolved, yet.

Copy link

@AnnKatrinBecker: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Seems this issue has not been resolved, yet.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@HumairAK
Copy link
Collaborator

/reopen

@google-oss-prow google-oss-prow bot reopened this May 14, 2024
Copy link

@HumairAK: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@github-actions github-actions bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label May 15, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 14, 2024
@HumairAK
Copy link
Collaborator

/remove-lifecycle stale

@google-oss-prow google-oss-prow bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 14, 2024
@haiminh2001
Copy link

haiminh2001 commented Aug 27, 2024

Hi, what is this status of this issue ? Has anyone solved this or found any walkaround ?

@hbelmiro
Copy link
Contributor

/assign

@hbelmiro hbelmiro removed their assignment Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs triage
Development

No branches or pull requests

8 participants