Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] kube-proxy image version 1.27 causing the kube-proxy to fail #6991

Open
artemisia480 opened this issue Aug 21, 2023 · 12 comments
Open
Assignees
Labels
kind/bug priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases

Comments

@artemisia480
Copy link

What were you trying to accomplish?

Trying to deploy a new cluster, version 1.27, using eksctl. i am running the command: eksctl create cluster...

What happened?

I get the following error and the nodes for the cluster never come up. Looking at the logs inside the node, I see this error:
ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1": pulling from host 602401143452.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests v1.27.1-minimal-eksbuild.1]

How to reproduce it?

I am using a yaml file to deploy this. Not sure how you would reproduce it. But if you look at the aws documentation here:
https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html
the image is meant to be eksbuild.2 and not 1.
and if you look at the eksctl code here: https://github.com/eksctl-io/eksctl/blob/c27d2e80f50aceb78c35c60b713f8e9267611dde/pkg/addons/default/kube_proxy.go#L150C1-L151
it is only calling eksbuild.1 and not 2.

Logs
ErrImagePull: rpc error: code = Unknown desc = failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1": pulling from host 602401143452.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests v1.27.1-minimal-eksbuild.1]

Anything else we need to know?

Versions
1.27

$ eksctl info
@github-actions
Copy link
Contributor

Hello artemisia480 👋 Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

@yoplait
Copy link

yoplait commented Aug 21, 2023

Thanks @artemisia480 same problem here.

@cPu1
Copy link
Collaborator

cPu1 commented Aug 22, 2023

i am running the command: eksctl create cluster...

@artemisia480 did you run any commands after eksctl create cluster, or did you try to update the image?

and if you look at the eksctl code here: https://github.com/eksctl-io/eksctl/blob/c27d2e80f50aceb78c35c60b713f8e9267611dde/pkg/addons/default/kube_proxy.go#L150C1-L151
it is only calling eksbuild.1 and not 2.

That codepath is not used in eksctl create cluster.

@cPu1
Copy link
Collaborator

cPu1 commented Aug 22, 2023

I'm unable to reproduce this. I got the same image tag (602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.27.1-minimal-eksbuild.1) on a new cluster and it was pulled successfully.

Can you share your config file?

@artemisia480
Copy link
Author

artemisia480 commented Aug 22, 2023

@cPu1 , the code doesn't use it? are you sure? but the aws documentation says to use eksbuild.2 and clearly this pulls 1.
here is my yaml file:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ami-testing-cluster2
  version: "1.27"
  region: us-east-1

vpc:
  clusterEndpoints:
    publicAccess: true
    privateAccess: false

managedNodeGroups:
  - name: ami-testing2
    ami:  <custome ami>
    amiFamily: AmazonLinux2
    instanceType: m6i.large
    volumeSize: 20
    disableIMDSv1: false
    ssh:
      allow: true
      publicKeyPath: ~/.ssh/id_rsa.pub
    overrideBootstrapCommand: |
      #!/bin/bash
      eks_register.sh ami-testing-cluster2
    iam:
      withAddonPolicies:
        externalDNS: true
        ebs: true
        autoScaler: true
        cloudWatch: false
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

@a-hilaly
Copy link
Member

Could this be an issue in a specific region? @artemisia480 do you have any clusters in other regions to confirm this?

@artemisia480
Copy link
Author

@a-hilaly not sure why it would be region specific? But I can test a different region just to see.

@a-hilaly
Copy link
Member

@artemisia480 not really sure, but if it's a pull issue, maybe the image is not available in every region. Or are we using ECR public here?
i'll try to replicate the same bug locally and update here.

@a-hilaly
Copy link
Member

@artemisia480 i haven't been able to reproduce your issue through 4/5 creations in different regions... maybe this is an issue with the custom AMI?

@artemisia480
Copy link
Author

@a-hilaly thanks for testing that! I am starting to think it is the customer AMI after all. i am not sure what though. I had the following flags in the AMI for 1.26, which I have removed now for 1.27:
KUBELET_EKS_ARGS=--node-ip=192.168.22.222
--pod-infra-container-image=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1
--cloud-provider aws
--config /etc/kubernetes/kubelet.json
--kubeconfig /etc/kubernetes/kubeconfig
--container-runtime remote
--container-runtime-endpoint unix:///var/run/containerd/containerd.sock

I also added the flag:
--seccomp-default=unconfined.

But having no luck.

@a-hilaly
Copy link
Member

a-hilaly commented Sep 5, 2023

Do you run any extra commands after creating the cluster? any daemonset updates?

@whereisaaron
Copy link

@artemisia480 I got a similar error when I added a containerd node group to an eks 1.23 cluster. The containerd nodes could not pull ECR image and reported the pull failed error. But the dockerd nodes in the same cluster could pull the exact same image. My test cluster was in a VPC that did not have an ECR endpoint, in case that is relevant.

There seems to be something extra that containerd nodes need. @a-hilaly any idea what that might be?

Pulling image "XXXXXX.dkr.ecr.ap-southeast-2.amazonaws.com/mycontainer:1.0.1" Warning Failed 8s (x3 over 47s) kubelet Failed to pull image "XXXXXX.dkr.ecr.ap-southeast-2.amazonaws.com/mycontainer:1.0.1": rpc error: code = NotFound desc = failed to pull and unpack image "XXXXXXX.dkr.ecr.ap-southeast-2.amazonaws.com/mycontainer:1.0.1": failed to copy: httpReadSeeker: failed open: could not fetch content descriptor sha256:d713dedd5b37c3ffea46d23c7933cc173c7755c789eab3bc60ea374cb5af740f (application/vnd.docker.distribution.manifest.v1+json) from remote: not found

@Himangini Himangini added the priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases label Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases
Projects
None yet
Development

No branches or pull requests

6 participants