Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to provision 7g.40gb slice on A100 40GB #285

Open
bharathappali opened this issue Nov 27, 2024 · 4 comments
Open

Unable to provision 7g.40gb slice on A100 40GB #285

bharathappali opened this issue Nov 27, 2024 · 4 comments

Comments

@bharathappali
Copy link

I was trying to create dynamic slices with instaslice on a openshift cluster which has a node with 4 A100 GPU's. I found that instaslice is creating MIG for any config less than 7g.40gb but it's not able to create a MIG for 7g.40gb

I have tried the same workload with 7g.40gb and 4g.20gb slice and here are the details.

Instaslice image built from the branch release-4.19

[abharath@abharath-thinkpadt14sgen2i instaslice-operator]$ git branch
  main
* release-4.19

Node allocatable resources:

Allocatable:
  cpu:                                             127500m
  ephemeral-storage:                               430324950326
  hugepages-1Gi:                                   0
  hugepages-2Mi:                                   0
  instaslice.redhat.com/accelerator-memory-quota:  160Gi
  instaslice.redhat.com/mig-1g.10gb:               16
  instaslice.redhat.com/mig-1g.5gb:                28
  instaslice.redhat.com/mig-1g.5gb+me:             28
  instaslice.redhat.com/mig-2g.10gb:               12
  instaslice.redhat.com/mig-3g.20gb:               8
  instaslice.redhat.com/mig-4g.20gb:               4
  instaslice.redhat.com/mig-7g.40gb:               4
  memory:                                          1055311156Ki
  nvidia.com/gpu:                                  0
  nvidia.com/mig-3g.20gb:                          0
  nvidia.com/mig-4g.20gb:                          0
  pods:                                            250

Instaslice controller logs:

{"level":"info","ts":"2024-11-27T09:22:26.337021649Z","caller":"controller/instaslice_controller.go:443","msg":"no suitable node found in cluster for ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"4991cd22-fe6f-4832-b820-a856ea5f01da","pod":"human-eval-deployment-job-j84fc"}
{"level":"info","ts":"2024-11-27T09:22:36.337786468Z","caller":"controller/capacity.go:48","msg":"cpu request obtained ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"a1cecafe-2a40-414b-8b22-38ad0698d2ea","pod":"human-eval-deployment-job-j84fc","value":2}
{"level":"info","ts":"2024-11-27T09:22:36.337871857Z","caller":"controller/capacity.go:56","msg":"memory request obtained ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"a1cecafe-2a40-414b-8b22-38ad0698d2ea","pod":"human-eval-deployment-job-j84fc","value":4294967296}
{"level":"info","ts":"2024-11-27T09:22:36.337903101Z","caller":"controller/instaslice_controller.go:443","msg":"no suitable node found in cluster for ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"a1cecafe-2a40-414b-8b22-38ad0698d2ea","pod":"human-eval-deployment-job-j84fc"}

workload yaml:

kind: Namespace
metadata:
  name: kruize-gpu-rec-apply
---
kind: Job
apiVersion: batch/v1
metadata:
  name: human-eval-deployment-job
  namespace: kruize-gpu-rec-apply
spec:
  template:
    spec:
      containers:
        - name: human-eval-benchmark
          image: 'quay.io/kruizehub/human-eval-deployment:latest'
          env:
            - name: num_prompts
              value: '20000' 
          resources:
            requests:
              cpu: 2
              memory: 4Gi
              nvidia.com/mig-7g.40gb: 1
            limits:
              cpu: 2
              memory: 4Gi
              nvidia.com/mig-7g.40gb: 1 
          volumeMounts:
            - name: cache-volume
              mountPath: /.cache/huggingface
          imagePullPolicy: IfNotPresent
      restartPolicy: Never
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: cache-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cache-pvc
  namespace: kruize-gpu-rec-apply
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi

Workload status:

[abharath@abharath-thinkpadt14sgen2i nerc]$ oc get pods -n kruize-gpu-rec-apply
NAME                              READY   STATUS            RESTARTS   AGE
human-eval-deployment-job-j84fc   0/1     SchedulingGated   0          5s

describe pod output:

[abharath@abharath-thinkpadt14sgen2i nerc]$ oc describe pod human-eval-deployment-job-j84fc -n kruize-gpu-rec-apply
Name:             human-eval-deployment-job-j84fc
Namespace:        kruize-gpu-rec-apply
Priority:         0
Service Account:  default
Node:             <none>
Labels:           batch.kubernetes.io/controller-uid=6663a528-6986-4fb6-8567-c7bb5a913ddf
                  batch.kubernetes.io/job-name=human-eval-deployment-job
                  controller-uid=6663a528-6986-4fb6-8567-c7bb5a913ddf
                  job-name=human-eval-deployment-job
Annotations:      openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:               
IPs:              <none>
Controlled By:    Job/human-eval-deployment-job
Containers:
  human-eval-benchmark:
    Image:      quay.io/kruizehub/human-eval-deployment:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                                             2
      instaslice.redhat.com/accelerator-memory-quota:  40Gi
      instaslice.redhat.com/mig-7g.40gb:               1
      memory:                                          4Gi
    Requests:
      cpu:                                             2
      instaslice.redhat.com/accelerator-memory-quota:  40Gi
      instaslice.redhat.com/mig-7g.40gb:               1
      memory:                                          4Gi
    Environment Variables from:
      982d3f10-36c6-4bca-b8a0-6b596a353e83  ConfigMap  Optional: false
    Environment:
      num_prompts:  20000
    Mounts:
      /.cache/huggingface from cache-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-smdnc (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  cache-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  cache-pvc
    ReadOnly:   false
  kube-api-access-smdnc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Note: it's working if I change nvidia.com/mig-7g.40gb: 1 in requests and limits to nvidia.com/mig-4g.20gb: 1

Controller logs when tried with 4g.20gb:

{"level":"info","ts":"2024-11-27T09:23:28.514660632Z","caller":"controller/instaslice_controller.go:203","msg":"finalizer deleted for failed for ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"27a11b3d-f8af-4fb2-9c2e-28d9bff3f2f8","pod":"human-eval-deployment-job-j84fc"}
{"level":"info","ts":"2024-11-27T09:24:15.329605712Z","caller":"controller/capacity.go:48","msg":"cpu request obtained ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"6a951761-a3e8-44ae-b15e-cff383da295d","pod":"human-eval-deployment-job-xs4tz","value":2}
{"level":"info","ts":"2024-11-27T09:24:15.329719412Z","caller":"controller/capacity.go:56","msg":"memory request obtained ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"6a951761-a3e8-44ae-b15e-cff383da295d","pod":"human-eval-deployment-job-xs4tz","value":4294967296}
{"level":"info","ts":"2024-11-27T09:24:15.329761712Z","caller":"controller/instaslice_controller.go:427","msg":"allocation obtained for ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"6a951761-a3e8-44ae-b15e-cff383da295d","pod":"human-eval-deployment-job-xs4tz"}
{"level":"error","ts":"2024-11-27T09:24:15.594604284Z","caller":"controller/instaslice_controller.go:763","msg":"error ungating pod","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"349e816d-b02e-4ba0-ba89-dc438e619177","error":"Operation cannot be fulfilled on pods \"human-eval-deployment-job-xs4tz\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstasliceReconciler).addNodeSelectorAndUngatePod\n\t/workspace/internal/controller/instaslice_controller.go:763\ngithub.com/openshift/instaslice-operator/internal/controller.(*InstasliceReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_controller.go:400\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"}
{"level":"info","ts":"2024-11-27T09:24:15.59478036Z","caller":"controller/controller.go:314","msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"349e816d-b02e-4ba0-ba89-dc438e619177"}
{"level":"error","ts":"2024-11-27T09:24:15.594807119Z","caller":"controller/controller.go:316","msg":"Reconciler error","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"349e816d-b02e-4ba0-ba89-dc438e619177","error":"Operation cannot be fulfilled on pods \"human-eval-deployment-job-xs4tz\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"}

Daemonset logs after applying 4g.20gb:

{"level":"info","ts":"2024-11-27T09:24:15.345122243Z","caller":"controller/instaslice_daemonset.go:162","msg":"creating allocation for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","pod":"human-eval-deployment-job-xs4tz"}
{"level":"info","ts":"2024-11-27T09:24:15.345971011Z","caller":"controller/instaslice_daemonset.go:221","msg":"The profile id is","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","giProfileInfo":5,"Memory":19968,"pod":"b218f0de-b0b3-43b2-ba3e-06129faae25e"}
{"level":"info","ts":"2024-11-27T09:24:15.34631395Z","caller":"controller/instaslice_daemonset.go:881","msg":"creating slice for","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","pod":"human-eval-deployment-job-xs4tz"}
{"level":"info","ts":"2024-11-27T09:24:15.514907103Z","caller":"controller/instaslice_daemonset.go:717","msg":"ConfigMap not found, creating for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","name":"7d9c93e0-e9b0-4562-8b12-8678019a0a5b","migGPUUUID":"MIG-ae3fdfff-d866-5cd6-a79d-94902fa9c5a0"}
{"level":"info","ts":"2024-11-27T09:24:15.534308794Z","caller":"controller/instaslice_daemonset.go:251","msg":"done creating mig slice for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","pod":"human-eval-deployment-job-xs4tz","parentgpu":"GPU-15ea50a3-01fd-b823-2c66-0e247db67a7d","miguuid":"MIG-ae3fdfff-d866-5cd6-a79d-94902fa9c5a0"}

Pod running with 4g.20gb:

[abharath@abharath-thinkpadt14sgen2i nerc]$ oc get pods -n kruize-gpu-rec-apply
NAME                              READY   STATUS    RESTARTS   AGE
human-eval-deployment-job-xs4tz   1/1     Running   0          2m15s

Pod describe:

[abharath@abharath-thinkpadt14sgen2i nerc]$ oc describe pod human-eval-deployment-job-xs4tz -n kruize-gpu-rec-apply
Name:             human-eval-deployment-job-xs4tz
Namespace:        kruize-gpu-rec-apply
Priority:         0
Service Account:  default
Node:             wrk-5/192.168.50.93
Start Time:       Wed, 27 Nov 2024 14:54:16 +0530
Labels:           batch.kubernetes.io/controller-uid=e6faf00e-7d7e-4026-bac1-ea4c42d868d8
                  batch.kubernetes.io/job-name=human-eval-deployment-job
                  controller-uid=e6faf00e-7d7e-4026-bac1-ea4c42d868d8
                  job-name=human-eval-deployment-job
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.129.4.53/23"],"mac_address":"0a:58:0a:81:04:35","gateway_ips":["10.129.4.1"],"routes":[{"dest":"10.128.0.0...
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "10.129.4.53"
                        ],
                        "mac": "0a:58:0a:81:04:35",
                        "default": true,
                        "dns": {}
                    }]
                  openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               10.129.4.53
IPs:
  IP:           10.129.4.53
Controlled By:  Job/human-eval-deployment-job
Containers:
  human-eval-benchmark:
    Container ID:   cri-o://b5747f5fc8ee06c42757f433742e6d50f9d183d32d9e868c9d71e5a528d4231c
    Image:          quay.io/kruizehub/human-eval-deployment:latest
    Image ID:       quay.io/kruizehub/human-eval-deployment@sha256:002649f767f242834c7349fd01d85f9929ef215fe7676bdf3cbc832049a130fd
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Wed, 27 Nov 2024 14:54:21 +0530
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                                             2
      instaslice.redhat.com/accelerator-memory-quota:  20Gi
      instaslice.redhat.com/mig-4g.20gb:               1
      memory:                                          4Gi
    Requests:
      cpu:                                             2
      instaslice.redhat.com/accelerator-memory-quota:  20Gi
      instaslice.redhat.com/mig-4g.20gb:               1
      memory:                                          4Gi
    Environment Variables from:
      7d9c93e0-e9b0-4562-8b12-8678019a0a5b  ConfigMap  Optional: false
    Environment:
      num_prompts:  20000
    Mounts:
      /.cache/huggingface from cache-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dz6v9 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  cache-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  cache-pvc
    ReadOnly:   false
  kube-api-access-dz6v9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              kubernetes.io/hostname=wrk-5
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age    From                     Message
  ----     ------                  ----   ----                     -------
  Warning  FailedScheduling        2m48s  default-scheduler        0/9 nodes are available: persistentvolumeclaim "cache-pvc" not found. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling.
  Normal   Scheduled               2m46s  default-scheduler        Successfully assigned kruize-gpu-rec-apply/human-eval-deployment-job-xs4tz to wrk-5
  Normal   SuccessfulAttachVolume  2m46s  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-ba228d4b-9bca-4b31-9d71-1f670fb54427"
  Normal   AddedInterface          2m43s  multus                   Add eth0 [10.129.4.53/23] from ovn-kubernetes
  Normal   Pulled                  2m43s  kubelet                  Container image "quay.io/kruizehub/human-eval-deployment:latest" already present on machine
  Normal   Created                 2m42s  kubelet                  Created container human-eval-benchmark
  Normal   Started                 2m42s  kubelet                  Started container human-eval-benchmark
@asm582
Copy link
Contributor

asm582 commented Nov 27, 2024

Thanks for this issue. Can you share nvidia-smi -L output before the slice creation and after slice creation.

@asm582
Copy link
Contributor

asm582 commented Nov 27, 2024

FYI, using the main branch on a KinD cluster, I am able to create 7g.40gb slice:

nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-31cfe05c-ed13-cd17-d7aa-c63db5108c24)
  MIG 7g.40gb     Device  0: (UUID: MIG-bd1776d4-5118-545c-8e87-30fde4a42225)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-8d042338-e67f-9c48-92b4-5b55c7e5133c)
(base) openstack@netsres62:~/asmalvan/gpu_pack/instaslice-operator$ kubectl describe pod
Name:             cuda-vectoradd-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             kind-control-plane/172.18.0.2
Start Time:       Wed, 27 Nov 2024 04:40:53 -0500
Labels:           <none>
Annotations:      <none>
Status:           Running
IP:               10.244.0.27
IPs:
  IP:  10.244.0.27
Containers:
  cuda-vectoradd-0:
    Container ID:  containerd://967df508228e456d9f83312dbf254c5e146a4c2281aff48deff886e7b3dffb5d
    Image:         quay.io/tardieu/vectoradd:0.1.0
    Image ID:      quay.io/tardieu/vectoradd@sha256:4d8d95ec884480d489056f3a8b202d4aeea744e4a0a481a20b90009614d40244
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      nvidia-smi -L; ./vectorAdd && sleep 1800
    State:          Running
      Started:      Wed, 27 Nov 2024 04:41:01 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      instaslice.redhat.com/accelerator-memory-quota:  40Gi
      instaslice.redhat.com/mig-7g.40gb:               1
    Requests:
      instaslice.redhat.com/accelerator-memory-quota:  40Gi
      instaslice.redhat.com/mig-7g.40gb:               1
    Environment Variables from:
      698f3e41-8f19-46f0-82f0-bd759fcb478f  ConfigMap  Optional: false
    Environment:                            <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dprt9 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  kube-api-access-dprt9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/hostname=kind-control-plane
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  11s   default-scheduler  Successfully assigned default/cuda-vectoradd-0 to kind-control-plane
  Normal  Pulling    10s   kubelet            Pulling image "quay.io/tardieu/vectoradd:0.1.0"
  Normal  Pulled     4s    kubelet            Successfully pulled image "quay.io/tardieu/vectoradd:0.1.0" in 6.064s (6.064s including waiting). Image size: 30691624 bytes.
  Normal  Created    3s    kubelet            Created container cuda-vectoradd-0
  Normal  Started    3s    kubelet            Started container cuda-vectoradd-0

@bharathappali
Copy link
Author

Thanks @asm582 I'll try with the main branch build.

@asm582
Copy link
Contributor

asm582 commented Jan 6, 2025

@bharathappali are you still seeing this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants