-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
csi-do-controller-0 CrashLoopBackOff: couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json" #328
Comments
Other information: I am using OKD 4.5 on Fedora CoreOS 31. The pod 4 out of 5 containers are in state
The last one
On the worker, the csi.sock is not in:
but in
|
Hi @max3903 the error
is odd because it usually means that you are not running on DigitalOcean infrastructure (as the error indicates). However, I do see a DO region label on one of your Nodes. Can you confirm that you are indeed running on droplets? Can you connect to the metadata endpoint from your nodes? What might also be good to know: did you try to apply the manifests on a cluster that had a previous version of the CSI driver installed already, or was this a first-time CSI installation attempt? |
Hello @timoreimann Yes I am running on droplets built from a custom image: Fedora CoreOS 31 for Digital Ocean from https://getfedora.org/en/coreos/download?tab=cloud_operators&stream=stable Yes, I can connect to the metadata endpoint from the 3 masters and 2 workers. That is actually how each droplet get their hostname during the installation: Yes, I tried to apply the manifests multiple time using different versions/urls. |
So I ran:
I don't know if it helps but the container created from the DaemonSet is working fine on the same node. Only the one created from the StatefulSet is crashing... |
@max3903 CSI driver in version 0.3.0 definitely does not support Kubernetes 1.17. (See also our support matrix.) If you installed that first, the subsequent 1.3.0 installation most likely failed because of unsupported (and broken) left-overs from 0.3.0. Can you try to install v1.3.0 from a clean slate, i.e., on a 1.17 cluster that does not come with any other (older) CSI driver versions installed beforehand? |
Even after running:
? |
@timoreimann Installing the cluster was a pretty painful process I would like to avoid. I removed all the
and installed the correct version (1.3.0). I still get the same error. Which left-overs am I missing? |
Check for any snapshot-related CRDs that might be remaining ( |
I deleted them. No errors when running:
Still the same behavior on the controller, i.e the pod csi-do-controller-0 remains in status CrashLoopBackOff. 4 out of 5 containers are in state Running but have this error message in the log:
The last one csi-do-plugin (digitalocean/do-csi-plugin:v1.3.0) remains in state Waiting and the logs says:
If I replace the args at https://github.com/digitalocean/csi-digitalocean/blob/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L194 with:
I get this message in the logs of the container:
I tried to run the container on the worker:
I also tried to use curl to create a volume through the API from the same node and it worked:
The container from the same image on the same node from the DaemonSet is still working fine:
FYI, all droplets are Fedora CoreOS 31 in SFO3 with this workaround to set the hostname: |
The |
@timoreimann With @lucab and @dustymabe help, I got it working by adding:
|
@max3903 glad you figured it out. 🎉 FWIW, the manifest you referenced (and had to amend) is what we use for our end-to-end tests as-is: we deploy it into a DOKS cluster and run upstream e2e tests against. I'm confused why it didn't work for you -- wondering if there's perhaps something specific about OKD (or DOKS) that explains the difference in behavior? |
@timoreimann Yes on the controller. @dustymabe mentioned that openshift has stricter security settings than base kubernetes. |
Typically that is the case. Unfortunately I don't have enough expertise to know what those extra security defaults are or if that's the cause of the issues here. I just know enough to bring up that it could be the cause. |
This seems to be working for me with just the |
Right, privileged mode should be needed on the Node service only to allow mount propagation. I don't think we have it set on our Controller service manifest. |
If you'd like to submit a quick PR to document the need to run on host network in OKD (and perhaps leave a commented out |
Thanks @timoreimann. Do you think it would make sense to do it by default instead of having it commented out? |
@dustymabe the only platform I'm aware of at this point that requires host networking to be enabled on the Controller service seems to be OKD. So I'm more inclined to keeping it commented out for now. |
I changed the
It might be worth noting that OKD uses OVN networking: https://docs.openshift.com/container-platform/4.5/networking/ovn_kubernetes_network_provider/about-ovn-kubernetes.html. Unfortunately I don't know much about the networking side so I'm a bit limited in understanding this. In order to workaround temporarily this patch command should work for users:
Can we change the title of this to |
@dustymabe Done! |
👋 So I've run into this issue as well using K3s on DO. I was able to finally get things running with |
I can confirm that the workaround in #328 (comment) still works for me today. |
What did you do? (required. The issue will be closed when not provided.)
I followed the documentation to add the do-block-storage plugin:
I added the secret successfully and run:
It fails on some snapshot specific stuff:
I moved on (I believe it is fixed by #322) and tried to create a PVC.
What did you expect to happen?
I was expecting the PV to be created.
Configuration (MUST fill this out):
https://gist.github.com/max3903/acb18527be1138a33d77f3eaaddb89a8
secret.yaml:
pvc.yaml:
1.3.0
1.17
OKD 4.5
The text was updated successfully, but these errors were encountered: