Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS propagation broken in newly-deployed systems #6951

Closed
davepacheco opened this issue Oct 29, 2024 · 1 comment · Fixed by #6956
Closed

DNS propagation broken in newly-deployed systems #6951

davepacheco opened this issue Oct 29, 2024 · 1 comment · Fixed by #6956
Assignees

Comments

@davepacheco
Copy link
Collaborator

tl;dr: It looks like #6794 changed RSS such that the initial internal DNS configuration has the wrong port number in the SRV records for the internal DNS servers. Since DNS propagation relies on these records, DNS propagation doesn't work, which unfortunately makes it hard to fix a system that has this problem. I do not believe this can ever affect systems deployed prior to #6794, which includes dogfood, colo, and all existing customer systems.


@andrewjstone reported in chat that while testing #6950 (which has minimal changes from "main" for the purpose of this ticket), he had code that was trying to look up newly-added clickhouse admin DNS records, but they weren't found, even though Reconfigurator had written them to DNS. He was seeing:

22:15:09.033Z WARN 80018545-6637-4afa-aaec-bc1f4e7a00f9 (ServerContext): Failed to lookup ClickhouseAdminKeeper in internal DNS: no record found for Query { name: Name("_clickhouse-admin-keeper._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN }. Is it enabled via policy?

Querying the DNS servers directly, we were not seeing the records:

root@oxz_switch:~# dig -t SRV _clickhouse-admin-keeper._tcp.control-plane.oxide.internal @fd00:1122:3344:1::1

; <<>> DiG 9.18.14 <<>> -t SRV _clickhouse-admin-keeper._tcp.control-plane.oxide.internal @fd00:1122:3344:1::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 49813
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_clickhouse-admin-keeper._tcp.control-plane.oxide.internal. IN SRV

;; Query time: 0 msec
;; SERVER: fd00:1122:3344:1::1#53(fd00:1122:3344:1::1) (UDP)
;; WHEN: Tue Oct 29 22:47:24 UTC 2024
;; MSG SIZE  rcvd: 76

but they were in the database in omdb db dns names internal 2 (where 2 is the latest internal DNS generation, found via omdb db dns show):

...
 _clickhouse-admin-keeper._tcp                      (records: 5)
      SRV  port  8888 21935905-6d28-4716-9619-a2f5e541e292.host.control-plane.oxide.internal
      SRV  port  8888 2ed19103-3b7e-4992-a99e-0152186e2546.host.control-plane.oxide.internal
      SRV  port  8888 4d87e4c7-d268-43c5-8f28-8fe6963bc509.host.control-plane.oxide.internal
      SRV  port  8888 77d707ee-1b07-400e-ac4a-aeb54d6250f1.host.control-plane.oxide.internal
      SRV  port  8888 943b471f-7790-43ae-9904-d536c5053781.host.control-plane.oxide.internal

so we looked at DNS propagation status:

root@oxz_switch:~# omdb nexus background-tasks show dns_internal
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:104::5]:12221
task: "dns_config_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 71, triggered by a periodic timer firing
    started at 2024-10-29T22:51:55.341Z (12s ago) and ran for 398ms
    last generation found: 2

task: "dns_servers_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 71, triggered by a periodic timer firing
    started at 2024-10-29T22:51:55.340Z (12s ago) and ran for 2ms
    servers found: 3

      DNS_SERVER_ADDR
      [fd00:1122:3344:1::1]:53
      [fd00:1122:3344:2::1]:53
      [fd00:1122:3344:3::1]:53

task: "dns_propagation_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 73, triggered by a periodic timer firing
    started at 2024-10-29T22:51:55.241Z (12s ago) and ran for 364ms
    attempt to propagate generation: 2

      DNS_SERVER_ADDR          LAST_RESULT
      [fd00:1122:3344:1::1]:53 error (see below)
      [fd00:1122:3344:2::1]:53 error (see below)
      [fd00:1122:3344:3::1]:53 error (see below)

    error: server [fd00:1122:3344:1::1]:53: failed to propagate DNS generation 2 to server [fd00:1122:3344:1::1]:53: Communication Error: error sending request for url (http://[fd00:1122:3344:1::1]:53/config): error sending request for url (http://[fd00:1122:3344:1::1]:53/config): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    error: server [fd00:1122:3344:2::1]:53: failed to propagate DNS generation 2 to server [fd00:1122:3344:2::1]:53: Communication Error: error sending request for url (http://[fd00:1122:3344:2::1]:53/config): error sending request for url (http://[fd00:1122:3344:2::1]:53/config): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    error: server [fd00:1122:3344:3::1]:53: failed to propagate DNS generation 2 to server [fd00:1122:3344:3::1]:53: Communication Error: error sending request for url (http://[fd00:1122:3344:3::1]:53/config): error sending request for url (http://[fd00:1122:3344:3::1]:53/config): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)

That's interesting -- I've never seen that fail. It's weird to get "Connection refused" -- we might think the DNS server process crashed, except that we'd just successfully queried it. But the port number there is surprising: 53 is the DNS protocol port, not the one we run an HTTP server on. Indeed, in dogfood the addresses are:

task: "dns_servers_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 8808, triggered by a periodic timer firing
    started at 2024-10-29T22:50:18.557Z (49s ago) and ran for 1ms
    servers found: 3

      DNS_SERVER_ADDR            
      [fd00:1122:3344:1::1]:5353 
      [fd00:1122:3344:2::1]:5353 
      [fd00:1122:3344:3::1]:5353 

which makes more sense. So where do these come from? The main place I think of is inside Reconfigurator:

) => (ServiceName::InternalDns, http_address),

That code uses the http_address, which has the HTTP port, which should be correct. Are we somehow putting the wrong value into http_address? That comes from RSS. I found that we weren't putting the wrong value there, but we are specifying the wrong port in the initial DNS config. This changed in #6794:
https://github.com/oxidecomputer/omicron/pull/6794/files#diff-9ea2b79544fdd0a21914ea354fba0b3670258746b1350d900285445d399861e1R468

Prior to that change, we were putting the HTTP port in there. Now it puts the DNS port in there. In retrospect, it makes sense that this is the relevant code path (not the Reconfigurator one) since @andrewjstone's system was trying to get to generation 2, which means it must have been at generation 1, which would have been generated by RSS.

I'm hopeful this was just a typo and we can just change the port used in RSS (and not that something else in that PR is depending on this being the DNS port).

@jgallagher
Copy link
Contributor

Oof, sorry. I was afraid something like this might slip by, hence not merging #6794 until after R11 was out the door. I think this is indeed just a typo and should be a quick fix; I'll get it tested and then get a PR open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants