Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to docker-ce-cli 5:27.0.3 breaks nomad #23523

Closed
ebarriosjr opened this issue Jul 9, 2024 · 10 comments
Closed

Upgrade to docker-ce-cli 5:27.0.3 breaks nomad #23523

ebarriosjr opened this issue Jul 9, 2024 · 10 comments

Comments

@ebarriosjr
Copy link
Contributor

Nomad version

Nomad v1.8.1
BuildDate 2024-06-19T06:43:57Z
Revision 5022543

Operating system and Environment details

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy

Issue

After upgrading docker-ce-cli from 5:27.0.2 to 5:27.0.3 nomad breaks. No containers were deployed. Some of them had the issue:
Constraint "missing network": 1 nodes excluded by filter, others were trying to use ipv6 instead of ipv4.

Reproduction steps

Update docker-ce-cli to version 5:27.0.3 and reboot.

Expected Result

Nomad would be able to spawn docker container without issue.

Actual Result

No container could be started

@tgross
Copy link
Member

tgross commented Jul 9, 2024

Hi @ebarriosjr! Nomad doesn't use the Docker CLI. From the package version number you've got there, I'm assuming you're using a downstream distribution and not Docker's own package? If I look at docker/cli@v27.0.2...v27.0.3 I see that they vendored the main moby/moby project at v27.0.3. And then if I look at the release notes for v27.0.3 I see some interesting suspects. So my guess is that dockerd itself was also upgraded by your package update? Before we go digging further, can you confirm that by providing the output of docker version?

@tgross tgross changed the title Upgrade to docker-ce-cli 5:27.0.3 brakes nomad Upgrade to docker-ce-cli 5:27.0.3 breaks nomad Jul 9, 2024
@tgross
Copy link
Member

tgross commented Jul 9, 2024

For what it's worth, I've upgraded my local environment to 27.0.3 and tested out a Nomad job with networking and wasn't able to reproduce any problems. Maybe there's something specific to your client configuration or job that you could share?

output of docker version
$ docker version
Client: Docker Engine - Community
 Version:           27.0.3
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        7d4bcd8
 Built:             Sat Jun 29 00:03:03 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:03:03 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

The other weird item here is this error Constraint "missing network": 1 nodes excluded by filter that you reported, because that suggests that there's something wrong with host fingerprinting of the network. And that doesn't involve Docker at all.

@MatthewJohn
Copy link

MatthewJohn commented Jul 18, 2024

Yesterday, after building a new nomad client, I've found that the connect envoy side-car ports are not being published correctly.
Nothing has changed in the setup except newer packages have been installed.

From what I can see, the other clients were running 26.X of docker-ce and the new one is running 27.X. The other clients had packages updates (mostly kernel and docker to 27.X and they've also started failing in the same way).

Happy to supply any info - from what I can see iptables has the entries for the allocations/ports, but getting connection refused.

The client was running 1.7.7, but have upgraded to 1.8.1, but still seeing the same issue.

I'm going to try and downgrade docker to see if it helps and will get back

Matt

@tgross
Copy link
Member

tgross commented Jul 18, 2024

Any chance you upgraded the host distro at the same time? There's an open issue around the bridge module having been baked-in rather than a DKM #23583 and that's hitting a known issue in our network fingerprinting. (Which previously only impacted niche OS distros.)

@ebarriosjr
Copy link
Contributor Author

Hi @tgross, the output of my docker version command is:

Client: Docker Engine - Community
 Version:           27.0.2
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        912c1dd
 Built:             Wed Jun 26 18:48:01 2024
 OS/Arch:           linux/arm64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:02:44 2024
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0```

@tgross
Copy link
Member

tgross commented Jul 19, 2024

Weird that your client and server don't match. But the server looks identical to what I've posted above. Any thoughts about the networking discussion above?

@ebarriosjr
Copy link
Contributor Author

Thats because i reverted the version of docker-ce-cli to 27.0.2. On 27.0.3 all the jobs that i have running on nomad stop working with the missing network error.

@MatthewJohn
Copy link

MatthewJohn commented Jul 20, 2024

Any chance you upgraded the host distro at the same time? There's an open issue around the bridge module having been baked-in rather than a DKM #23583 and that's hitting a known issue in our network fingerprinting. (Which previously only impacted niche OS distros.)

Assuming this was aimed at me.. I'm running Debian bookworm, which definitely hasn't changed. As I say, it could be something completely unrelated, but a port-forwarding issue would presumably be a nomad client-related issue (as opposed to nomad servers, consul etc. related) and all the clients did so after they were rebooted and the only thing that had changed were package updates (plus a re-install, which included the latest docker version).

I'm just following up on the downgrade to see if it helped :)

Matt

Edit: No, the downgrade didn't help - so probably completely unrelated. Apologies, I'll continue my investigation

Edit edit: Yes, please completely ignore me - mine was actually the connect PKI root CA expiring (but happened during a powerdown, so the affect was quite different - envoy would start "happily" without any errors/warnings, but just didn't listen on any of the service ports!)

@tgross
Copy link
Member

tgross commented Jul 23, 2024

Ok, thanks @MatthewJohn. So @ebarriosjr that leaves the networking, as I mentioned earlier:

The other weird item here is this error Constraint "missing network": 1 nodes excluded by filter that you reported, because that suggests that there's something wrong with host fingerprinting of the network. And that doesn't involve Docker at all.

#23583 suggests that something may have changed in the environment where the bridge kernel module is unavailable, but I'd expect to see a network still. For us to make further progress on this we'll need information from you on the network fingerprint (and/or client logs from the network fingerprinting), whether the distro has been updated, whether the kernel module is present, etc.

@ebarriosjr
Copy link
Contributor Author

Hi @tgross, I just upgraded my system to the latest packages and the issue is gone. Maybe it was related to the bridge issue.
Thanks for taking a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants