-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] (Kubeflow) PyTorchPlugin sets Replicas to 0 casuing infinite loop #5417
Comments
Thank you for opening your first issue here! 🛠 |
The issue you're experiencing with the PyTorchPlugin setting replicas to 0, causing an infinite loop, is likely related to the configuration or deployment of the PyTorch operator in your Flyte setup. According to this discussion, ensuring the PyTorch operator is correctly deployed in your Kubernetes cluster is crucial. Additionally, there might be a need to check for any recent changes or refactoring in the plugin, as mentioned in another discussion. Would you like more details on how to verify and potentially resolve this issue? Sources:
This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not. Mention @RunLLM to ask follow-up questions. See runllm.com for more! |
@Future-Outlier , can you confirm which version of Flyte and flytekit you were running? |
Flyte: master branch with single binary |
@Future-Outlier , can you try this out with Flyte 1.10.7? You can start single-binary via flytectl and point to a release: |
|
@Future-Outlier , sorry for the late reply. We require a the prefix |
No problem, doing it, will tell you the result in 1 hour |
Hi, @eapolinario , it works. Setup Process
001-plugins.yaml: |
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
pytorch: pytorch
enabled-plugins:
- container
- sidecar
- k8s-array
- agent-service
- pytorch
|
@Future-Outlier , which version of the kubeflow training operator are you running? Also, can you paste the pod definition here? |
Hi, @eapolinario (dev) future@outlier ~ % kubectl describe pod training-operator-984cfd546-2jn65 -n kubeflow
Name: training-operator-984cfd546-2jn65
Namespace: kubeflow
Priority: 0
Service Account: training-operator
Node: 481e7e029920/172.17.0.2
Start Time: Wed, 12 Jun 2024 09:39:45 +0800
Labels: control-plane=kubeflow-training-operator
pod-template-hash=984cfd546
Annotations: sidecar.istio.io/inject: false
Status: Running
IP: 10.42.0.10
IPs:
IP: 10.42.0.10
Controlled By: ReplicaSet/training-operator-984cfd546
Containers:
training-operator:
Container ID: containerd://1f46d23264f737a7f07c26d087b91ca890a5862fd9fcf2d0b8a95c479c5db343
Image: kubeflow/training-operator:v1-855e096
Image ID: docker.io/kubeflow/training-operator@sha256:725f0adb8910336625566b391bba35391d712c0ffff6a4be02863cebceaa7cf8
Port: 8080/TCP
Host Port: 0/TCP
Command:
/manager
State: Running
Started: Wed, 12 Jun 2024 09:39:56 +0800
Ready: True
Restart Count: 0
Liveness: http-get http://:8081/healthz delay=15s timeout=3s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=10s timeout=3s period=15s #success=1 #failure=3
Environment:
MY_POD_NAMESPACE: kubeflow (v1:metadata.namespace)
MY_POD_NAME: training-operator-984cfd546-2jn65 (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tjwpd (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-tjwpd:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 43s default-scheduler Successfully assigned kubeflow/training-operator-984cfd546-2jn65 to 481e7e029920
Normal Pulling 43s kubelet Pulling image "kubeflow/training-operator:v1-855e096"
Normal Pulled 32s kubelet Successfully pulled image "kubeflow/training-operator:v1-855e096" in 10.99020538s
Normal Created 32s kubelet Created container training-operator
Normal Started 32s kubelet Started container training-operator |
@Future-Outlier , thanks for being so thorough. As a final step, can you paste the pytorchjob CR object created as part of the pytorch job and also the task pod? I just want to make sure the values are reflected there. |
pytorchjob CR object created(dev) future@outlier ~ % kubectl get crd pytorchjobs.kubeflow.org
NAME CREATED AT
pytorchjobs.kubeflow.org 2024-06-13T01:59:15Z Name: f591aa743583746998b7-fg3djuyi-0
Namespace: flytesnacks-development
Labels: domain=development
execution-id=f591aa743583746998b7
interruptible=false
node-id=pytorchexamplemnistpytorchjob
project=flytesnacks
shard-key=22
task-name=pytorch-example-mnist-pytorch-job
workflow-name=flytegen-pytorch-example-mnist-pytorch-job
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: false
API Version: kubeflow.org/v1
Kind: PyTorchJob
Metadata:
Creation Timestamp: 2024-06-13T02:04:25Z
Generation: 1
Owner References:
API Version: flyte.lyft.com/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: flyteworkflow
Name: f591aa743583746998b7
UID: 4d6c99c5-81c3-4ef3-9617-d3435fe06bc3
Resource Version: 1038
UID: ca0f76c8-b3d1-4f85-8f0f-8b9dff9d99d2
Spec:
Pytorch Replica Specs:
Master:
Replicas: 1
Restart Policy: Never
Template:
Metadata:
Annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: false
Labels:
Domain: development
Execution - Id: f591aa743583746998b7
Interruptible: false
Node - Id: pytorchexamplemnistpytorchjob
Project: flytesnacks
Shard - Key: 22
Task - Name: pytorch-example-mnist-pytorch-job
Workflow - Name: flytegen-pytorch-example-mnist-pytorch-job
Spec:
Affinity:
Containers:
Args:
pyflyte-fast-execute
--additional-distribution
s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
--dest-dir
.
--
pyflyte-execute
--inputs
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
--output-prefix
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
--raw-output-data-prefix
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
--checkpoint-path
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
pytorch_example
task-name
mnist_pytorch_job
Env:
Name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
Value: flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_EXECUTION_ID
Value: f591aa743583746998b7
Name: FLYTE_INTERNAL_EXECUTION_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_EXECUTION_DOMAIN
Value: development
Name: FLYTE_ATTEMPT_NUMBER
Value: 0
Name: FLYTE_INTERNAL_TASK_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_TASK_DOMAIN
Value: development
Name: FLYTE_INTERNAL_TASK_NAME
Value: pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_TASK_VERSION
Value: TPtCnpd9zLfeKcUJ5IeFDw
Name: FLYTE_INTERNAL_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_DOMAIN
Value: development
Name: FLYTE_INTERNAL_NAME
Value: pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_VERSION
Value: TPtCnpd9zLfeKcUJ5IeFDw
Name: FLYTE_AWS_SECRET_ACCESS_KEY
Value: miniostorage
Name: FLYTE_AWS_ENDPOINT
Value: http://flyte-sandbox-minio.flyte:9000
Name: FLYTE_AWS_ACCESS_KEY_ID
Value: minio
Image: localhost:30000/torch-0611:latest
Name: pytorch
Resources:
Limits:
Cpu: 500m
Memory: 500Mi
Requests:
Cpu: 500m
Memory: 500Mi
Termination Message Policy: FallbackToLogsOnError
Restart Policy: Never
Worker:
Replicas: 2
Restart Policy: Never
Template:
Metadata:
Annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: false
Labels:
Domain: development
Execution - Id: f591aa743583746998b7
Interruptible: false
Node - Id: pytorchexamplemnistpytorchjob
Project: flytesnacks
Shard - Key: 22
Task - Name: pytorch-example-mnist-pytorch-job
Workflow - Name: flytegen-pytorch-example-mnist-pytorch-job
Spec:
Affinity:
Containers:
Args:
pyflyte-fast-execute
--additional-distribution
s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
--dest-dir
.
--
pyflyte-execute
--inputs
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
--output-prefix
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
--raw-output-data-prefix
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
--checkpoint-path
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
pytorch_example
task-name
mnist_pytorch_job
Env:
Name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
Value: flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_EXECUTION_ID
Value: f591aa743583746998b7
Name: FLYTE_INTERNAL_EXECUTION_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_EXECUTION_DOMAIN
Value: development
Name: FLYTE_ATTEMPT_NUMBER
Value: 0
Name: FLYTE_INTERNAL_TASK_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_TASK_DOMAIN
Value: development
Name: FLYTE_INTERNAL_TASK_NAME
Value: pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_TASK_VERSION
Value: TPtCnpd9zLfeKcUJ5IeFDw
Name: FLYTE_INTERNAL_PROJECT
Value: flytesnacks
Name: FLYTE_INTERNAL_DOMAIN
Value: development
Name: FLYTE_INTERNAL_NAME
Value: pytorch_example.mnist_pytorch_job
Name: FLYTE_INTERNAL_VERSION
Value: TPtCnpd9zLfeKcUJ5IeFDw
Name: FLYTE_AWS_ENDPOINT
Value: http://flyte-sandbox-minio.flyte:9000
Name: FLYTE_AWS_ACCESS_KEY_ID
Value: minio
Name: FLYTE_AWS_SECRET_ACCESS_KEY
Value: miniostorage
Image: localhost:30000/torch-0611:latest
Name: pytorch
Resources:
Limits:
Cpu: 500m
Memory: 500Mi
Requests:
Cpu: 500m
Memory: 500Mi
Termination Message Policy: FallbackToLogsOnError
Restart Policy: Never
Run Policy:
Suspend: false
Status:
Completion Time: 2024-06-13T02:04:41Z
Conditions:
Last Transition Time: 2024-06-13T02:04:25Z
Last Update Time: 2024-06-13T02:04:25Z
Message: PyTorchJob f591aa743583746998b7-fg3djuyi-0 is created.
Reason: PyTorchJobCreated
Status: True
Type: Created
Last Transition Time: 2024-06-13T02:04:27Z
Last Update Time: 2024-06-13T02:04:27Z
Message: PyTorchJob f591aa743583746998b7-fg3djuyi-0 is running.
Reason: PyTorchJobRunning
Status: False
Type: Running
Last Transition Time: 2024-06-13T02:04:41Z
Last Update Time: 2024-06-13T02:04:41Z
Message: PyTorchJob f591aa743583746998b7-fg3djuyi-0 is successfully completed.
Reason: PyTorchJobSucceeded
Status: True
Type: Succeeded
Replica Statuses:
Master:
Selector: training.kubeflow.org/job-name=f591aa743583746998b7-fg3djuyi-0,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=master
Succeeded: 1
Worker:
Succeeded: 2
Start Time: 2024-06-13T02:04:26Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 16m pytorchjob-controller Created pod: f591aa743583746998b7-fg3djuyi-0-master-0
Normal SuccessfulCreateService 16m pytorchjob-controller Created service: f591aa743583746998b7-fg3djuyi-0-master-0
Normal SuccessfulCreatePod 16m pytorchjob-controller Created pod: f591aa743583746998b7-fg3djuyi-0-worker-0
Warning SettedPodTemplateRestartPolicy 16m (x3 over 16m) pytorchjob-controller Restart policy in pod template will be overwritten by restart policy in replica spec
Normal SuccessfulCreatePod 16m pytorchjob-controller Created pod: f591aa743583746998b7-fg3djuyi-0-worker-1
Normal SuccessfulCreateService 16m pytorchjob-controller Created service: f591aa743583746998b7-fg3djuyi-0-worker-0
Normal SuccessfulCreateService 16m pytorchjob-controller Created service: f591aa743583746998b7-fg3djuyi-0-worker-1
Normal ExitedWithCode 16m (x3 over 16m) pytorchjob-controller Pod: flytesnacks-development.f591aa743583746998b7-fg3djuyi-0-master-0 exited with code 0
Normal ExitedWithCode 16m (x2 over 16m) pytorchjob-controller Pod: flytesnacks-development.f591aa743583746998b7-fg3djuyi-0-worker-0 exited with code 0
Normal PyTorchJobSucceeded 16m pytorchjob-controller PyTorchJob f591aa743583746998b7-fg3djuyi-0 is successfully completed.
pytorch job task podmaster pod(dev) future@outlier ~ % kubectl describe pod f591aa743583746998b7-fg3djuyi-0-master-0 -n flytesnacks-development
Name: f591aa743583746998b7-fg3djuyi-0-master-0
Namespace: flytesnacks-development
Priority: 0
Service Account: default
Node: 1002541774d8/172.17.0.2
Start Time: Thu, 13 Jun 2024 10:04:25 +0800
Labels: domain=development
execution-id=f591aa743583746998b7
interruptible=false
node-id=pytorchexamplemnistpytorchjob
project=flytesnacks
shard-key=22
task-name=pytorch-example-mnist-pytorch-job
training.kubeflow.org/job-name=f591aa743583746998b7-fg3djuyi-0
training.kubeflow.org/job-role=master
training.kubeflow.org/operator-name=pytorchjob-controller
training.kubeflow.org/replica-index=0
training.kubeflow.org/replica-type=master
workflow-name=flytegen-pytorch-example-mnist-pytorch-job
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status: Succeeded
IP: 10.42.0.14
IPs:
IP: 10.42.0.14
Controlled By: PyTorchJob/f591aa743583746998b7-fg3djuyi-0
Containers:
pytorch:
Container ID: containerd://53ff705a7daf13acdd4a29828d4eac82ff3ee64d298fa5d6807643dfd0768ffa
Image: localhost:30000/torch-0611:latest
Image ID: localhost:30000/torch-0611@sha256:f3d76504e47fa1950347721ec159083870494c1602fbfeb9ed86dacf5a6a4d83
Port: 23456/TCP
Host Port: 0/TCP
Args:
pyflyte-fast-execute
--additional-distribution
s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
--dest-dir
.
--
pyflyte-execute
--inputs
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
--output-prefix
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
--raw-output-data-prefix
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
--checkpoint-path
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
pytorch_example
task-name
mnist_pytorch_job
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 13 Jun 2024 10:04:26 +0800
Finished: Thu, 13 Jun 2024 10:04:38 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 500Mi
Requests:
cpu: 500m
memory: 500Mi
Environment:
FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_EXECUTION_ID: f591aa743583746998b7
FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks
FLYTE_INTERNAL_EXECUTION_DOMAIN: development
FLYTE_ATTEMPT_NUMBER: 0
FLYTE_INTERNAL_TASK_PROJECT: flytesnacks
FLYTE_INTERNAL_TASK_DOMAIN: development
FLYTE_INTERNAL_TASK_NAME: pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_TASK_VERSION: TPtCnpd9zLfeKcUJ5IeFDw
FLYTE_INTERNAL_PROJECT: flytesnacks
FLYTE_INTERNAL_DOMAIN: development
FLYTE_INTERNAL_NAME: pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_VERSION: TPtCnpd9zLfeKcUJ5IeFDw
FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
FLYTE_AWS_ENDPOINT: http://flyte-sandbox-minio.flyte:9000
FLYTE_AWS_ACCESS_KEY_ID: minio
PYTHONUNBUFFERED: 1
MASTER_PORT: 23456
PET_MASTER_PORT: 23456
MASTER_ADDR: f591aa743583746998b7-fg3djuyi-0-master-0
PET_MASTER_ADDR: f591aa743583746998b7-fg3djuyi-0-master-0
WORLD_SIZE: 3
RANK: 0
PET_NPROC_PER_NODE: auto
PET_NODE_RANK: 0
PET_NNODES: 3
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-46gf5 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-46gf5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m23s default-scheduler Successfully assigned flytesnacks-development/f591aa743583746998b7-fg3djuyi-0-master-0 to 1002541774d8
Normal Pulling 5m23s kubelet Pulling image "localhost:30000/torch-0611:latest"
Normal Pulled 5m23s kubelet Successfully pulled image "localhost:30000/torch-0611:latest" in 8.59375ms
Normal Created 5m23s kubelet Created container pytorch
Normal Started 5m23s kubelet Started container pytorch worker pod(dev) future@outlier ~ % kubectl describe pod f591aa743583746998b7-fg3djuyi-0-worker-0 -n flytesnacks-development
Name: f591aa743583746998b7-fg3djuyi-0-worker-0
Namespace: flytesnacks-development
Priority: 0
Service Account: default
Node: 1002541774d8/172.17.0.2
Start Time: Thu, 13 Jun 2024 10:04:25 +0800
Labels: domain=development
execution-id=f591aa743583746998b7
interruptible=false
node-id=pytorchexamplemnistpytorchjob
project=flytesnacks
shard-key=22
task-name=pytorch-example-mnist-pytorch-job
training.kubeflow.org/job-name=f591aa743583746998b7-fg3djuyi-0
training.kubeflow.org/operator-name=pytorchjob-controller
training.kubeflow.org/replica-index=0
training.kubeflow.org/replica-type=worker
workflow-name=flytegen-pytorch-example-mnist-pytorch-job
Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status: Succeeded
IP: 10.42.0.15
IPs:
IP: 10.42.0.15
Controlled By: PyTorchJob/f591aa743583746998b7-fg3djuyi-0
Init Containers:
init-pytorch:
Container ID: containerd://8f5cdc0d84ded4e3a22a62a85ef98981e970d2f0ef67feea4afb5e240eabb044
Image: alpine:3.10
Image ID: docker.io/library/alpine@sha256:451eee8bedcb2f029756dc3e9d73bab0e7943c1ac55cff3a4861c52a0fdd3e98
Port: <none>
Host Port: <none>
Command:
sh
-c
err=1;for i in $(seq 100); do if nslookup f591aa743583746998b7-fg3djuyi-0-master-0; then err=0 && break; fi;echo waiting for master; sleep 2; done; exit $err
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 13 Jun 2024 10:04:33 +0800
Finished: Thu, 13 Jun 2024 10:04:33 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 20Mi
Requests:
cpu: 50m
memory: 10Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lw7ww (ro)
Containers:
pytorch:
Container ID: containerd://182ef52d6b7b456586f9a2d37685bd202fe76ff7e094ec50242047f9604b741b
Image: localhost:30000/torch-0611:latest
Image ID: localhost:30000/torch-0611@sha256:f3d76504e47fa1950347721ec159083870494c1602fbfeb9ed86dacf5a6a4d83
Port: 23456/TCP
Host Port: 0/TCP
Args:
pyflyte-fast-execute
--additional-distribution
s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
--dest-dir
.
--
pyflyte-execute
--inputs
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
--output-prefix
s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
--raw-output-data-prefix
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
--checkpoint-path
s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
--prev-checkpoint
""
--resolver
flytekit.core.python_auto_container.default_task_resolver
--
task-module
pytorch_example
task-name
mnist_pytorch_job
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 13 Jun 2024 10:04:34 +0800
Finished: Thu, 13 Jun 2024 10:04:41 +0800
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 500Mi
Requests:
cpu: 500m
memory: 500Mi
Environment:
FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_EXECUTION_ID: f591aa743583746998b7
FLYTE_INTERNAL_EXECUTION_PROJECT: flytesnacks
FLYTE_INTERNAL_EXECUTION_DOMAIN: development
FLYTE_ATTEMPT_NUMBER: 0
FLYTE_INTERNAL_TASK_PROJECT: flytesnacks
FLYTE_INTERNAL_TASK_DOMAIN: development
FLYTE_INTERNAL_TASK_NAME: pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_TASK_VERSION: TPtCnpd9zLfeKcUJ5IeFDw
FLYTE_INTERNAL_PROJECT: flytesnacks
FLYTE_INTERNAL_DOMAIN: development
FLYTE_INTERNAL_NAME: pytorch_example.mnist_pytorch_job
FLYTE_INTERNAL_VERSION: TPtCnpd9zLfeKcUJ5IeFDw
FLYTE_AWS_ENDPOINT: http://flyte-sandbox-minio.flyte:9000
FLYTE_AWS_ACCESS_KEY_ID: minio
FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
PYTHONUNBUFFERED: 1
MASTER_PORT: 23456
PET_MASTER_PORT: 23456
MASTER_ADDR: f591aa743583746998b7-fg3djuyi-0-master-0
PET_MASTER_ADDR: f591aa743583746998b7-fg3djuyi-0-master-0
WORLD_SIZE: 3
RANK: 1
PET_NPROC_PER_NODE: auto
PET_NODE_RANK: 1
PET_NNODES: 3
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lw7ww (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-lw7ww:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m26s default-scheduler Successfully assigned flytesnacks-development/f591aa743583746998b7-fg3djuyi-0-worker-0 to 1002541774d8
Normal Pulling 6m26s kubelet Pulling image "alpine:3.10"
Normal Pulled 6m19s kubelet Successfully pulled image "alpine:3.10" in 6.911665252s
Normal Created 6m19s kubelet Created container init-pytorch
Normal Started 6m19s kubelet Started container init-pytorch
Normal Pulling 6m18s kubelet Pulling image "localhost:30000/torch-0611:latest"
Normal Pulled 6m18s kubelet Successfully pulled image "localhost:30000/torch-0611:latest" in 28.368833ms
Normal Created 6m18s kubelet Created container pytorch
Normal Started 6m18s kubelet Started container pytorch |
Hi, thank you very much for looking into this. The version was the issue. I used Flytekit version 1.12.0 to deploy the test workflow, my Flyte backend was old, at version v1.1.32. Using Flytekit version 1.2.12 the workers were set to the right number. |
Nice, I can close the issue now! |
Flyte version: v1.10.7
Kubeflow operator version: previous three versions
I am running the MNIST example workflow, which uses the Kubeflow PyTorch operator:(https://github.com/flyteorg/flytesnacks/blob/master/examples/kfpytorch_plugin/kfpytorch_plugin/pytorch_mnist.py)
I was able to run it without problems on Flyte v1.5. The result was that the Kubeflow PyTorch operator started one master node and two worker nodes in the Flyte projects namespace.
However, on v1.10.7, the worker nodes are immediately entered into a delete-and-recreate loop when the workflow starts. I noticed the difference in the PyTorchJob CRD, which Flyte creates, is the replicas configured for the Worker nodes: On Flyte v1.5 this was correctly set to 2, but on 1.10.7, it is set to 0, leading to a loop.
...
Worker:
replicas: 0
...
I have tried using versions 1,6, 1.7 and the latest prerelease 1.8 for the kubeflow operator, but there isn't any change, so I believe the problem is in the plugin.
The text was updated successfully, but these errors were encountered: