Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] (Kubeflow) PyTorchPlugin sets Replicas to 0 casuing infinite loop #5417

Closed
pwistnok opened this issue May 23, 2024 · 16 comments
Closed
Assignees

Comments

@pwistnok
Copy link

pwistnok commented May 23, 2024

Flyte version: v1.10.7
Kubeflow operator version: previous three versions

I am running the MNIST example workflow, which uses the Kubeflow PyTorch operator:(https://github.com/flyteorg/flytesnacks/blob/master/examples/kfpytorch_plugin/kfpytorch_plugin/pytorch_mnist.py)
I was able to run it without problems on Flyte v1.5. The result was that the Kubeflow PyTorch operator started one master node and two worker nodes in the Flyte projects namespace.
However, on v1.10.7, the worker nodes are immediately entered into a delete-and-recreate loop when the workflow starts. I noticed the difference in the PyTorchJob CRD, which Flyte creates, is the replicas configured for the Worker nodes: On Flyte v1.5 this was correctly set to 2, but on 1.10.7, it is set to 0, leading to a loop.
...
Worker:
replicas: 0
...
I have tried using versions 1,6, 1.7 and the latest prerelease 1.8 for the kubeflow operator, but there isn't any change, so I believe the problem is in the plugin.

Copy link

welcome bot commented May 23, 2024

Thank you for opening your first issue here! 🛠

Copy link

runllm bot commented May 23, 2024

The issue you're experiencing with the PyTorchPlugin setting replicas to 0, causing an infinite loop, is likely related to the configuration or deployment of the PyTorch operator in your Flyte setup. According to this discussion, ensuring the PyTorch operator is correctly deployed in your Kubernetes cluster is crucial. Additionally, there might be a need to check for any recent changes or refactoring in the plugin, as mentioned in another discussion.

Would you like more details on how to verify and potentially resolve this issue?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@eapolinario eapolinario self-assigned this May 23, 2024
@Future-Outlier
Copy link
Member

I've tried this example, but it works well.

from flytekit import Resources, task


from flytekitplugins.kfpytorch import PyTorch, Worker

# %%
cpu_request = "500m"
mem_request = "500Mi"
gpu_request = "0"
mem_limit = "500Mi"
gpu_limit = "0"


# %%
@task(
    task_config=PyTorch(worker=Worker(replicas=2)),
    retries=2,
    # cache=True,
    # cache_version="0.1",
    requests=Resources(cpu=cpu_request, mem=mem_request, gpu=gpu_request),
    limits=Resources(mem=mem_limit, gpu=gpu_limit),
)
def mnist_pytorch_job() -> str:
    return "Hi"
FROM python:3.9-slim-buster
USER root
WORKDIR /root
ENV PYTHONPATH /root
RUN apt-get update && apt-get install build-essential -y
RUN apt-get install git -y

RUN pip install flytekitplugins-kfpytorch
image image

@eapolinario
Copy link
Contributor

@Future-Outlier , can you confirm which version of Flyte and flytekit you were running?

@Future-Outlier
Copy link
Member

@Future-Outlier , can you confirm which version of Flyte and flytekit you were running?

Flyte: master branch with single binary
flytekit: 1.12.0
kubeflow training operator: kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

@eapolinario
Copy link
Contributor

@Future-Outlier , can you try this out with Flyte 1.10.7? You can start single-binary via flytectl and point to a release: flytectl demo start --version 1.10.7.

@Future-Outlier
Copy link
Member

@Future-Outlier , can you try this out with Flyte 1.10.7? You can start single-binary via flytectl and point to a release: flytectl demo start --version 1.10.7.

I can help, but this version fails
image

@eapolinario
Copy link
Contributor

@Future-Outlier , sorry for the late reply. We require a the prefix v for the version. Can you try v1.10.7?

@Future-Outlier
Copy link
Member

@Future-Outlier , sorry for the late reply. We require a the prefix v for the version. Can you try v1.10.7?

No problem, doing it, will tell you the result in 1 hour

@Future-Outlier
Copy link
Member

Member

Hi, @eapolinario , it works.
image

Setup Process

  1. flytectl demo start --version v1.10.7 --disable-agent
  2. edit config map of flyte-sandbox-config
  001-plugins.yaml: |
    tasks:
      task-plugins:
        default-for-task-types:
          container: container
          container_array: k8s-array
          sidecar: sidecar
          pytorch: pytorch
        enabled-plugins:
        - container
        - sidecar
        - k8s-array
        - agent-service
        - pytorch
  1. restart flyte-sandbox deployment
  2. build a docker image for Pytorch Job (listed above)
  3. specify the image with "4" and run it on the remote cluster
image

@eapolinario
Copy link
Contributor

@Future-Outlier , which version of the kubeflow training operator are you running? Also, can you paste the pod definition here?

@Future-Outlier
Copy link
Member

@Future-Outlier , which version of the kubeflow training operator are you running? Also, can you paste the pod definition here?

Hi, @eapolinario
kubeflow training operator version: v1.10.7
pod definition:

(dev) future@outlier ~ % kubectl describe pod training-operator-984cfd546-2jn65 -n kubeflow
Name:             training-operator-984cfd546-2jn65
Namespace:        kubeflow
Priority:         0
Service Account:  training-operator
Node:             481e7e029920/172.17.0.2
Start Time:       Wed, 12 Jun 2024 09:39:45 +0800
Labels:           control-plane=kubeflow-training-operator
                  pod-template-hash=984cfd546
Annotations:      sidecar.istio.io/inject: false
Status:           Running
IP:               10.42.0.10
IPs:
  IP:           10.42.0.10
Controlled By:  ReplicaSet/training-operator-984cfd546
Containers:
  training-operator:
    Container ID:  containerd://1f46d23264f737a7f07c26d087b91ca890a5862fd9fcf2d0b8a95c479c5db343
    Image:         kubeflow/training-operator:v1-855e096
    Image ID:      docker.io/kubeflow/training-operator@sha256:725f0adb8910336625566b391bba35391d712c0ffff6a4be02863cebceaa7cf8
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /manager
    State:          Running
      Started:      Wed, 12 Jun 2024 09:39:56 +0800
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:8081/healthz delay=15s timeout=3s period=20s #success=1 #failure=3
    Readiness:      http-get http://:8081/readyz delay=10s timeout=3s period=15s #success=1 #failure=3
    Environment:
      MY_POD_NAMESPACE:  kubeflow (v1:metadata.namespace)
      MY_POD_NAME:       training-operator-984cfd546-2jn65 (v1:metadata.name)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tjwpd (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-tjwpd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  43s   default-scheduler  Successfully assigned kubeflow/training-operator-984cfd546-2jn65 to 481e7e029920
  Normal  Pulling    43s   kubelet            Pulling image "kubeflow/training-operator:v1-855e096"
  Normal  Pulled     32s   kubelet            Successfully pulled image "kubeflow/training-operator:v1-855e096" in 10.99020538s
  Normal  Created    32s   kubelet            Created container training-operator
  Normal  Started    32s   kubelet            Started container training-operator

@eapolinario
Copy link
Contributor

eapolinario commented Jun 12, 2024

@Future-Outlier , thanks for being so thorough. As a final step, can you paste the pytorchjob CR object created as part of the pytorch job and also the task pod? I just want to make sure the values are reflected there.

@Future-Outlier
Copy link
Member

@eapolinario

pytorchjob CR object created

(dev) future@outlier ~ % kubectl get crd pytorchjobs.kubeflow.org
NAME                       CREATED AT
pytorchjobs.kubeflow.org   2024-06-13T01:59:15Z
Name:         f591aa743583746998b7-fg3djuyi-0
Namespace:    flytesnacks-development
Labels:       domain=development
              execution-id=f591aa743583746998b7
              interruptible=false
              node-id=pytorchexamplemnistpytorchjob
              project=flytesnacks
              shard-key=22
              task-name=pytorch-example-mnist-pytorch-job
              workflow-name=flytegen-pytorch-example-mnist-pytorch-job
Annotations:  cluster-autoscaler.kubernetes.io/safe-to-evict: false
API Version:  kubeflow.org/v1
Kind:         PyTorchJob
Metadata:
  Creation Timestamp:  2024-06-13T02:04:25Z
  Generation:          1
  Owner References:
    API Version:           flyte.lyft.com/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  flyteworkflow
    Name:                  f591aa743583746998b7
    UID:                   4d6c99c5-81c3-4ef3-9617-d3435fe06bc3
  Resource Version:        1038
  UID:                     ca0f76c8-b3d1-4f85-8f0f-8b9dff9d99d2
Spec:
  Pytorch Replica Specs:
    Master:
      Replicas:        1
      Restart Policy:  Never
      Template:
        Metadata:
          Annotations:
            cluster-autoscaler.kubernetes.io/safe-to-evict:  false
          Labels:
            Domain:           development
            Execution - Id:   f591aa743583746998b7
            Interruptible:    false
            Node - Id:        pytorchexamplemnistpytorchjob
            Project:          flytesnacks
            Shard - Key:      22
            Task - Name:      pytorch-example-mnist-pytorch-job
            Workflow - Name:  flytegen-pytorch-example-mnist-pytorch-job
        Spec:
          Affinity:
          Containers:
            Args:
              pyflyte-fast-execute
              --additional-distribution
              s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
              --dest-dir
              .
              --
              pyflyte-execute
              --inputs
              s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
              --output-prefix
              s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
              --raw-output-data-prefix
              s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
              --checkpoint-path
              s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
              --prev-checkpoint
              ""
              --resolver
              flytekit.core.python_auto_container.default_task_resolver
              --
              task-module
              pytorch_example
              task-name
              mnist_pytorch_job
            Env:
              Name:   FLYTE_INTERNAL_EXECUTION_WORKFLOW
              Value:  flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
              Name:   FLYTE_INTERNAL_EXECUTION_ID
              Value:  f591aa743583746998b7
              Name:   FLYTE_INTERNAL_EXECUTION_PROJECT
              Value:  flytesnacks
              Name:   FLYTE_INTERNAL_EXECUTION_DOMAIN
              Value:  development
              Name:   FLYTE_ATTEMPT_NUMBER
              Value:  0
              Name:   FLYTE_INTERNAL_TASK_PROJECT
              Value:  flytesnacks
              Name:   FLYTE_INTERNAL_TASK_DOMAIN
              Value:  development
              Name:   FLYTE_INTERNAL_TASK_NAME
              Value:  pytorch_example.mnist_pytorch_job
              Name:   FLYTE_INTERNAL_TASK_VERSION
              Value:  TPtCnpd9zLfeKcUJ5IeFDw
              Name:   FLYTE_INTERNAL_PROJECT
              Value:  flytesnacks
              Name:   FLYTE_INTERNAL_DOMAIN
              Value:  development
              Name:   FLYTE_INTERNAL_NAME
              Value:  pytorch_example.mnist_pytorch_job
              Name:   FLYTE_INTERNAL_VERSION
              Value:  TPtCnpd9zLfeKcUJ5IeFDw
              Name:   FLYTE_AWS_SECRET_ACCESS_KEY
              Value:  miniostorage
              Name:   FLYTE_AWS_ENDPOINT
              Value:  http://flyte-sandbox-minio.flyte:9000
              Name:   FLYTE_AWS_ACCESS_KEY_ID
              Value:  minio
            Image:    localhost:30000/torch-0611:latest
            Name:     pytorch
            Resources:
              Limits:
                Cpu:     500m
                Memory:  500Mi
              Requests:
                Cpu:                     500m
                Memory:                  500Mi
            Termination Message Policy:  FallbackToLogsOnError
          Restart Policy:                Never
    Worker:
      Replicas:        2
      Restart Policy:  Never
      Template:
        Metadata:
          Annotations:
            cluster-autoscaler.kubernetes.io/safe-to-evict:  false
          Labels:
            Domain:           development
            Execution - Id:   f591aa743583746998b7
            Interruptible:    false
            Node - Id:        pytorchexamplemnistpytorchjob
            Project:          flytesnacks
            Shard - Key:      22
            Task - Name:      pytorch-example-mnist-pytorch-job
            Workflow - Name:  flytegen-pytorch-example-mnist-pytorch-job
        Spec:
          Affinity:
          Containers:
            Args:
              pyflyte-fast-execute
              --additional-distribution
              s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
              --dest-dir
              .
              --
              pyflyte-execute
              --inputs
              s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
              --output-prefix
              s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
              --raw-output-data-prefix
              s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
              --checkpoint-path
              s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
              --prev-checkpoint
              ""
              --resolver
              flytekit.core.python_auto_container.default_task_resolver
              --
              task-module
              pytorch_example
              task-name
              mnist_pytorch_job
            Env:
              Name:   FLYTE_INTERNAL_EXECUTION_WORKFLOW
              Value:  flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
              Name:   FLYTE_INTERNAL_EXECUTION_ID
              Value:  f591aa743583746998b7
              Name:   FLYTE_INTERNAL_EXECUTION_PROJECT
              Value:  flytesnacks
              Name:   FLYTE_INTERNAL_EXECUTION_DOMAIN
              Value:  development
              Name:   FLYTE_ATTEMPT_NUMBER
              Value:  0
              Name:   FLYTE_INTERNAL_TASK_PROJECT
              Value:  flytesnacks
              Name:   FLYTE_INTERNAL_TASK_DOMAIN
              Value:  development
              Name:   FLYTE_INTERNAL_TASK_NAME
              Value:  pytorch_example.mnist_pytorch_job
              Name:   FLYTE_INTERNAL_TASK_VERSION
              Value:  TPtCnpd9zLfeKcUJ5IeFDw
              Name:   FLYTE_INTERNAL_PROJECT
              Value:  flytesnacks
              Name:   FLYTE_INTERNAL_DOMAIN
              Value:  development
              Name:   FLYTE_INTERNAL_NAME
              Value:  pytorch_example.mnist_pytorch_job
              Name:   FLYTE_INTERNAL_VERSION
              Value:  TPtCnpd9zLfeKcUJ5IeFDw
              Name:   FLYTE_AWS_ENDPOINT
              Value:  http://flyte-sandbox-minio.flyte:9000
              Name:   FLYTE_AWS_ACCESS_KEY_ID
              Value:  minio
              Name:   FLYTE_AWS_SECRET_ACCESS_KEY
              Value:  miniostorage
            Image:    localhost:30000/torch-0611:latest
            Name:     pytorch
            Resources:
              Limits:
                Cpu:     500m
                Memory:  500Mi
              Requests:
                Cpu:                     500m
                Memory:                  500Mi
            Termination Message Policy:  FallbackToLogsOnError
          Restart Policy:                Never
  Run Policy:
    Suspend:  false
Status:
  Completion Time:  2024-06-13T02:04:41Z
  Conditions:
    Last Transition Time:  2024-06-13T02:04:25Z
    Last Update Time:      2024-06-13T02:04:25Z
    Message:               PyTorchJob f591aa743583746998b7-fg3djuyi-0 is created.
    Reason:                PyTorchJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2024-06-13T02:04:27Z
    Last Update Time:      2024-06-13T02:04:27Z
    Message:               PyTorchJob f591aa743583746998b7-fg3djuyi-0 is running.
    Reason:                PyTorchJobRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2024-06-13T02:04:41Z
    Last Update Time:      2024-06-13T02:04:41Z
    Message:               PyTorchJob f591aa743583746998b7-fg3djuyi-0 is successfully completed.
    Reason:                PyTorchJobSucceeded
    Status:                True
    Type:                  Succeeded
  Replica Statuses:
    Master:
      Selector:   training.kubeflow.org/job-name=f591aa743583746998b7-fg3djuyi-0,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=master
      Succeeded:  1
    Worker:
      Succeeded:  2
  Start Time:     2024-06-13T02:04:26Z
Events:
  Type     Reason                          Age                From                   Message
  ----     ------                          ----               ----                   -------
  Normal   SuccessfulCreatePod             16m                pytorchjob-controller  Created pod: f591aa743583746998b7-fg3djuyi-0-master-0
  Normal   SuccessfulCreateService         16m                pytorchjob-controller  Created service: f591aa743583746998b7-fg3djuyi-0-master-0
  Normal   SuccessfulCreatePod             16m                pytorchjob-controller  Created pod: f591aa743583746998b7-fg3djuyi-0-worker-0
  Warning  SettedPodTemplateRestartPolicy  16m (x3 over 16m)  pytorchjob-controller  Restart policy in pod template will be overwritten by restart policy in replica spec
  Normal   SuccessfulCreatePod             16m                pytorchjob-controller  Created pod: f591aa743583746998b7-fg3djuyi-0-worker-1
  Normal   SuccessfulCreateService         16m                pytorchjob-controller  Created service: f591aa743583746998b7-fg3djuyi-0-worker-0
  Normal   SuccessfulCreateService         16m                pytorchjob-controller  Created service: f591aa743583746998b7-fg3djuyi-0-worker-1
  Normal   ExitedWithCode                  16m (x3 over 16m)  pytorchjob-controller  Pod: flytesnacks-development.f591aa743583746998b7-fg3djuyi-0-master-0 exited with code 0
  Normal   ExitedWithCode                  16m (x2 over 16m)  pytorchjob-controller  Pod: flytesnacks-development.f591aa743583746998b7-fg3djuyi-0-worker-0 exited with code 0
  Normal   PyTorchJobSucceeded             16m                pytorchjob-controller  PyTorchJob f591aa743583746998b7-fg3djuyi-0 is successfully completed.

pytorch job task pod

master pod

(dev) future@outlier ~ % kubectl describe pod f591aa743583746998b7-fg3djuyi-0-master-0 -n flytesnacks-development
Name:             f591aa743583746998b7-fg3djuyi-0-master-0
Namespace:        flytesnacks-development
Priority:         0
Service Account:  default
Node:             1002541774d8/172.17.0.2
Start Time:       Thu, 13 Jun 2024 10:04:25 +0800
Labels:           domain=development
                  execution-id=f591aa743583746998b7
                  interruptible=false
                  node-id=pytorchexamplemnistpytorchjob
                  project=flytesnacks
                  shard-key=22
                  task-name=pytorch-example-mnist-pytorch-job
                  training.kubeflow.org/job-name=f591aa743583746998b7-fg3djuyi-0
                  training.kubeflow.org/job-role=master
                  training.kubeflow.org/operator-name=pytorchjob-controller
                  training.kubeflow.org/replica-index=0
                  training.kubeflow.org/replica-type=master
                  workflow-name=flytegen-pytorch-example-mnist-pytorch-job
Annotations:      cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status:           Succeeded
IP:               10.42.0.14
IPs:
  IP:           10.42.0.14
Controlled By:  PyTorchJob/f591aa743583746998b7-fg3djuyi-0
Containers:
  pytorch:
    Container ID:  containerd://53ff705a7daf13acdd4a29828d4eac82ff3ee64d298fa5d6807643dfd0768ffa
    Image:         localhost:30000/torch-0611:latest
    Image ID:      localhost:30000/torch-0611@sha256:f3d76504e47fa1950347721ec159083870494c1602fbfeb9ed86dacf5a6a4d83
    Port:          23456/TCP
    Host Port:     0/TCP
    Args:
      pyflyte-fast-execute
      --additional-distribution
      s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
      --dest-dir
      .
      --
      pyflyte-execute
      --inputs
      s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
      --output-prefix
      s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
      --raw-output-data-prefix
      s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
      --checkpoint-path
      s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
      --prev-checkpoint
      ""
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      pytorch_example
      task-name
      mnist_pytorch_job
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 13 Jun 2024 10:04:26 +0800
      Finished:     Thu, 13 Jun 2024 10:04:38 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  500Mi
    Requests:
      cpu:     500m
      memory:  500Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
      FLYTE_INTERNAL_EXECUTION_ID:        f591aa743583746998b7
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           pytorch_example.mnist_pytorch_job
      FLYTE_INTERNAL_TASK_VERSION:        TPtCnpd9zLfeKcUJ5IeFDw
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                pytorch_example.mnist_pytorch_job
      FLYTE_INTERNAL_VERSION:             TPtCnpd9zLfeKcUJ5IeFDw
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      FLYTE_AWS_ENDPOINT:                 http://flyte-sandbox-minio.flyte:9000
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      PYTHONUNBUFFERED:                   1
      MASTER_PORT:                        23456
      PET_MASTER_PORT:                    23456
      MASTER_ADDR:                        f591aa743583746998b7-fg3djuyi-0-master-0
      PET_MASTER_ADDR:                    f591aa743583746998b7-fg3djuyi-0-master-0
      WORLD_SIZE:                         3
      RANK:                               0
      PET_NPROC_PER_NODE:                 auto
      PET_NODE_RANK:                      0
      PET_NNODES:                         3
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-46gf5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-46gf5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  5m23s  default-scheduler  Successfully assigned flytesnacks-development/f591aa743583746998b7-fg3djuyi-0-master-0 to 1002541774d8
  Normal  Pulling    5m23s  kubelet            Pulling image "localhost:30000/torch-0611:latest"
  Normal  Pulled     5m23s  kubelet            Successfully pulled image "localhost:30000/torch-0611:latest" in 8.59375ms
  Normal  Created    5m23s  kubelet            Created container pytorch
  Normal  Started    5m23s  kubelet            Started container pytorch

worker pod

(dev) future@outlier ~ % kubectl describe pod f591aa743583746998b7-fg3djuyi-0-worker-0  -n flytesnacks-development
Name:             f591aa743583746998b7-fg3djuyi-0-worker-0
Namespace:        flytesnacks-development
Priority:         0
Service Account:  default
Node:             1002541774d8/172.17.0.2
Start Time:       Thu, 13 Jun 2024 10:04:25 +0800
Labels:           domain=development
                  execution-id=f591aa743583746998b7
                  interruptible=false
                  node-id=pytorchexamplemnistpytorchjob
                  project=flytesnacks
                  shard-key=22
                  task-name=pytorch-example-mnist-pytorch-job
                  training.kubeflow.org/job-name=f591aa743583746998b7-fg3djuyi-0
                  training.kubeflow.org/operator-name=pytorchjob-controller
                  training.kubeflow.org/replica-index=0
                  training.kubeflow.org/replica-type=worker
                  workflow-name=flytegen-pytorch-example-mnist-pytorch-job
Annotations:      cluster-autoscaler.kubernetes.io/safe-to-evict: false
Status:           Succeeded
IP:               10.42.0.15
IPs:
  IP:           10.42.0.15
Controlled By:  PyTorchJob/f591aa743583746998b7-fg3djuyi-0
Init Containers:
  init-pytorch:
    Container ID:  containerd://8f5cdc0d84ded4e3a22a62a85ef98981e970d2f0ef67feea4afb5e240eabb044
    Image:         alpine:3.10
    Image ID:      docker.io/library/alpine@sha256:451eee8bedcb2f029756dc3e9d73bab0e7943c1ac55cff3a4861c52a0fdd3e98
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      err=1;for i in $(seq 100); do if nslookup f591aa743583746998b7-fg3djuyi-0-master-0; then err=0 && break; fi;echo waiting for master; sleep 2; done; exit $err
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 13 Jun 2024 10:04:33 +0800
      Finished:     Thu, 13 Jun 2024 10:04:33 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  20Mi
    Requests:
      cpu:        50m
      memory:     10Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lw7ww (ro)
Containers:
  pytorch:
    Container ID:  containerd://182ef52d6b7b456586f9a2d37685bd202fe76ff7e094ec50242047f9604b741b
    Image:         localhost:30000/torch-0611:latest
    Image ID:      localhost:30000/torch-0611@sha256:f3d76504e47fa1950347721ec159083870494c1602fbfeb9ed86dacf5a6a4d83
    Port:          23456/TCP
    Host Port:     0/TCP
    Args:
      pyflyte-fast-execute
      --additional-distribution
      s3://my-s3-bucket/flytesnacks/development/ITAMN37CAV3JQW7JGLGIN66WMI======/script_mode.tar.gz
      --dest-dir
      .
      --
      pyflyte-execute
      --inputs
      s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/inputs.pb
      --output-prefix
      s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f591aa743583746998b7/pytorchexamplemnistpytorchjob/data/0
      --raw-output-data-prefix
      s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0
      --checkpoint-path
      s3://my-s3-bucket/data/bd/f591aa743583746998b7-fg3djuyi-0/_flytecheckpoints
      --prev-checkpoint
      ""
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      pytorch_example
      task-name
      mnist_pytorch_job
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 13 Jun 2024 10:04:34 +0800
      Finished:     Thu, 13 Jun 2024 10:04:41 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  500Mi
    Requests:
      cpu:     500m
      memory:  500Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:.flytegen.pytorch_example.mnist_pytorch_job
      FLYTE_INTERNAL_EXECUTION_ID:        f591aa743583746998b7
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           pytorch_example.mnist_pytorch_job
      FLYTE_INTERNAL_TASK_VERSION:        TPtCnpd9zLfeKcUJ5IeFDw
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                pytorch_example.mnist_pytorch_job
      FLYTE_INTERNAL_VERSION:             TPtCnpd9zLfeKcUJ5IeFDw
      FLYTE_AWS_ENDPOINT:                 http://flyte-sandbox-minio.flyte:9000
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      PYTHONUNBUFFERED:                   1
      MASTER_PORT:                        23456
      PET_MASTER_PORT:                    23456
      MASTER_ADDR:                        f591aa743583746998b7-fg3djuyi-0-master-0
      PET_MASTER_ADDR:                    f591aa743583746998b7-fg3djuyi-0-master-0
      WORLD_SIZE:                         3
      RANK:                               1
      PET_NPROC_PER_NODE:                 auto
      PET_NODE_RANK:                      1
      PET_NNODES:                         3
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lw7ww (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-lw7ww:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  6m26s  default-scheduler  Successfully assigned flytesnacks-development/f591aa743583746998b7-fg3djuyi-0-worker-0 to 1002541774d8
  Normal  Pulling    6m26s  kubelet            Pulling image "alpine:3.10"
  Normal  Pulled     6m19s  kubelet            Successfully pulled image "alpine:3.10" in 6.911665252s
  Normal  Created    6m19s  kubelet            Created container init-pytorch
  Normal  Started    6m19s  kubelet            Started container init-pytorch
  Normal  Pulling    6m18s  kubelet            Pulling image "localhost:30000/torch-0611:latest"
  Normal  Pulled     6m18s  kubelet            Successfully pulled image "localhost:30000/torch-0611:latest" in 28.368833ms
  Normal  Created    6m18s  kubelet            Created container pytorch
  Normal  Started    6m18s  kubelet            Started container pytorch

@pwistnok
Copy link
Author

Hi, thank you very much for looking into this. The version was the issue. I used Flytekit version 1.12.0 to deploy the test workflow, my Flyte backend was old, at version v1.1.32. Using Flytekit version 1.2.12 the workers were set to the right number.

@Future-Outlier
Copy link
Member

Nice, I can close the issue now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants