You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr: It looks like #6794 changed RSS such that the initial internal DNS configuration has the wrong port number in the SRV records for the internal DNS servers. Since DNS propagation relies on these records, DNS propagation doesn't work, which unfortunately makes it hard to fix a system that has this problem. I do not believe this can ever affect systems deployed prior to #6794, which includes dogfood, colo, and all existing customer systems.
@andrewjstone reported in chat that while testing #6950 (which has minimal changes from "main" for the purpose of this ticket), he had code that was trying to look up newly-added clickhouse admin DNS records, but they weren't found, even though Reconfigurator had written them to DNS. He was seeing:
22:15:09.033Z WARN 80018545-6637-4afa-aaec-bc1f4e7a00f9 (ServerContext): Failed to lookup ClickhouseAdminKeeper in internal DNS: no record found for Query { name: Name("_clickhouse-admin-keeper._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN }. Is it enabled via policy?
Querying the DNS servers directly, we were not seeing the records:
root@oxz_switch:~# dig -t SRV _clickhouse-admin-keeper._tcp.control-plane.oxide.internal @fd00:1122:3344:1::1
; <<>> DiG 9.18.14 <<>> -t SRV _clickhouse-admin-keeper._tcp.control-plane.oxide.internal @fd00:1122:3344:1::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 49813
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;_clickhouse-admin-keeper._tcp.control-plane.oxide.internal. IN SRV
;; Query time: 0 msec
;; SERVER: fd00:1122:3344:1::1#53(fd00:1122:3344:1::1) (UDP)
;; WHEN: Tue Oct 29 22:47:24 UTC 2024
;; MSG SIZE rcvd: 76
but they were in the database in omdb db dns names internal 2 (where 2 is the latest internal DNS generation, found via omdb db dns show):
...
_clickhouse-admin-keeper._tcp (records: 5)
SRV port 8888 21935905-6d28-4716-9619-a2f5e541e292.host.control-plane.oxide.internal
SRV port 8888 2ed19103-3b7e-4992-a99e-0152186e2546.host.control-plane.oxide.internal
SRV port 8888 4d87e4c7-d268-43c5-8f28-8fe6963bc509.host.control-plane.oxide.internal
SRV port 8888 77d707ee-1b07-400e-ac4a-aeb54d6250f1.host.control-plane.oxide.internal
SRV port 8888 943b471f-7790-43ae-9904-d536c5053781.host.control-plane.oxide.internal
so we looked at DNS propagation status:
root@oxz_switch:~# omdb nexus background-tasks show dns_internal
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:104::5]:12221
task: "dns_config_internal"
configured period: every 1m
currently executing: no
last completed activation: iter 71, triggered by a periodic timer firing
started at 2024-10-29T22:51:55.341Z (12s ago) and ran for 398ms
last generation found: 2
task: "dns_servers_internal"
configured period: every 1m
currently executing: no
last completed activation: iter 71, triggered by a periodic timer firing
started at 2024-10-29T22:51:55.340Z (12s ago) and ran for 2ms
servers found: 3
DNS_SERVER_ADDR
[fd00:1122:3344:1::1]:53
[fd00:1122:3344:2::1]:53
[fd00:1122:3344:3::1]:53
task: "dns_propagation_internal"
configured period: every 1m
currently executing: no
last completed activation: iter 73, triggered by a periodic timer firing
started at 2024-10-29T22:51:55.241Z (12s ago) and ran for 364ms
attempt to propagate generation: 2
DNS_SERVER_ADDR LAST_RESULT
[fd00:1122:3344:1::1]:53 error (see below)
[fd00:1122:3344:2::1]:53 error (see below)
[fd00:1122:3344:3::1]:53 error (see below)
error: server [fd00:1122:3344:1::1]:53: failed to propagate DNS generation 2 to server [fd00:1122:3344:1::1]:53: Communication Error: error sending request for url (http://[fd00:1122:3344:1::1]:53/config): error sending request for url (http://[fd00:1122:3344:1::1]:53/config): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
error: server [fd00:1122:3344:2::1]:53: failed to propagate DNS generation 2 to server [fd00:1122:3344:2::1]:53: Communication Error: error sending request for url (http://[fd00:1122:3344:2::1]:53/config): error sending request for url (http://[fd00:1122:3344:2::1]:53/config): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
error: server [fd00:1122:3344:3::1]:53: failed to propagate DNS generation 2 to server [fd00:1122:3344:3::1]:53: Communication Error: error sending request for url (http://[fd00:1122:3344:3::1]:53/config): error sending request for url (http://[fd00:1122:3344:3::1]:53/config): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
That's interesting -- I've never seen that fail. It's weird to get "Connection refused" -- we might think the DNS server process crashed, except that we'd just successfully queried it. But the port number there is surprising: 53 is the DNS protocol port, not the one we run an HTTP server on. Indeed, in dogfood the addresses are:
task: "dns_servers_internal"
configured period: every 1m
currently executing: no
last completed activation: iter 8808, triggered by a periodic timer firing
started at 2024-10-29T22:50:18.557Z (49s ago) and ran for 1ms
servers found: 3
DNS_SERVER_ADDR
[fd00:1122:3344:1::1]:5353
[fd00:1122:3344:2::1]:5353
[fd00:1122:3344:3::1]:5353
which makes more sense. So where do these come from? The main place I think of is inside Reconfigurator:
Prior to that change, we were putting the HTTP port in there. Now it puts the DNS port in there. In retrospect, it makes sense that this is the relevant code path (not the Reconfigurator one) since @andrewjstone's system was trying to get to generation 2, which means it must have been at generation 1, which would have been generated by RSS.
I'm hopeful this was just a typo and we can just change the port used in RSS (and not that something else in that PR is depending on this being the DNS port).
The text was updated successfully, but these errors were encountered:
Oof, sorry. I was afraid something like this might slip by, hence not merging #6794 until after R11 was out the door. I think this is indeed just a typo and should be a quick fix; I'll get it tested and then get a PR open.
tl;dr: It looks like #6794 changed RSS such that the initial internal DNS configuration has the wrong port number in the SRV records for the internal DNS servers. Since DNS propagation relies on these records, DNS propagation doesn't work, which unfortunately makes it hard to fix a system that has this problem. I do not believe this can ever affect systems deployed prior to #6794, which includes dogfood, colo, and all existing customer systems.
@andrewjstone reported in chat that while testing #6950 (which has minimal changes from "main" for the purpose of this ticket), he had code that was trying to look up newly-added clickhouse admin DNS records, but they weren't found, even though Reconfigurator had written them to DNS. He was seeing:
Querying the DNS servers directly, we were not seeing the records:
but they were in the database in
omdb db dns names internal 2
(where 2 is the latest internal DNS generation, found viaomdb db dns show
):so we looked at DNS propagation status:
That's interesting -- I've never seen that fail. It's weird to get "Connection refused" -- we might think the DNS server process crashed, except that we'd just successfully queried it. But the port number there is surprising: 53 is the DNS protocol port, not the one we run an HTTP server on. Indeed, in dogfood the addresses are:
which makes more sense. So where do these come from? The main place I think of is inside Reconfigurator:
omicron/nexus/types/src/deployment/execution/dns.rs
Line 108 in d4263cb
That code uses the
http_address
, which has the HTTP port, which should be correct. Are we somehow putting the wrong value intohttp_address
? That comes from RSS. I found that we weren't putting the wrong value there, but we are specifying the wrong port in the initial DNS config. This changed in #6794:https://github.com/oxidecomputer/omicron/pull/6794/files#diff-9ea2b79544fdd0a21914ea354fba0b3670258746b1350d900285445d399861e1R468
Prior to that change, we were putting the HTTP port in there. Now it puts the DNS port in there. In retrospect, it makes sense that this is the relevant code path (not the Reconfigurator one) since @andrewjstone's system was trying to get to generation 2, which means it must have been at generation 1, which would have been generated by RSS.
I'm hopeful this was just a typo and we can just change the port used in RSS (and not that something else in that PR is depending on this being the DNS port).
The text was updated successfully, but these errors were encountered: