omdb: Show clickhouse keeper membership in inventory #6950

andrewjstone · 2024-10-29T22:54:40Z

Also fix a bug related to DNS lookup of clickhouse-admin-keeper servers.

davepacheco · 2024-10-30T18:51:43Z

nexus/src/app/background/tasks/inventory_collection.rs

+            warn!(
+                opctx.log,
+                concat!(
+                    "Failed to lookup ClickhouseAdminKeeper in internal DNS:",
+                    " {}. Is it enabled via policy?"
+                ),
+                err
+            );
+            vec![]


Is it easy to distinguish here between "found nothing" vs. some other DNS error? It seems like we'd want to treat some other DNS error the way we treat an error to look up MGS.

I don't love adding a warning-level message for a condition that's expected on many systems today. People might reasonably think something is wrong (especially since it's going to be repeated). Where's the place where we'd be in a position to know whether this is wrong? Would it be in Reconfigurator, when we look at this information? Maybe that's where the warning should go?

Is it easy to distinguish here between "found nothing" vs. some other DNS error? It seems like we'd want to treat some other DNS error the way we treat an error to look up MGS.

Good point, so return early in that case it's not "found nothing".

I don't love adding a warning-level message for a condition that's expected on many systems today. People might reasonably think something is wrong (especially since it's going to be repeated). Where's the place where we'd be in a position to know whether this is wrong? Would it be in Reconfigurator, when we look at this information? Maybe that's where the warning should go?

I went back and forth on this a bunch. Originally I had this as a debug message with a comment about changing it to a warning when we expect for multi-node clickhouse to always be deployed. But even in this case, we won't deploy it immediately until reconfigurator runs, and so we'd see a few warning messages. That doesn't seem that bad. We have lots of warnings that work like that now. I changed it to a warning in this PR thinking that even with a github issue I'd be likely to forget this altogether and never change from debug to warning. All that being said, this did make me uncomfortable for the reasons you mention.

We have two extra checks we could make to see if we should have DNS records or not: One is to check the clickhouse-policy in the datastore. We could do that easily enough here. Again, until reconfigurator runs and actually updates DNS and it propagates, we will still see this warning. The other is to do as you suggest and check the blueprint, but we wouldn't necessarily want to do that here, and I think it would have the same problem.

Where's the place where we'd be in a position to know whether this is wrong? Would it be in Reconfigurator, when we look at this information? Maybe that's where the warning should go?

I put this here particularly because I was having trouble getting back records that reconfigurator put there. This is a nice local error that tells me where to look. I think putting it in reconfigurator is somewhat misguided, and in any case we don't know why the records didn't come back, just that they didn't. It could be a network partition, it could be that the keepers didn't start yet, or it could be DNS.

Where's the place where we'd be in a position to know whether this is wrong? Would it be in Reconfigurator, when we look at this information? Maybe that's where the warning should go?

Oh, are you saying to not just check the inventory, but to actually do the dns lookups and see if they succeed in the blueprint? That's kind odd since blueprint generation only happens on demand currently.

I'm not sure, honestly. What I was thinking is: ideally the warning should be in a place where we know both (1) whether there should be records, and (2) whether there were records. Here we know if there were records, but not if there should have been any, so I was thinking maybe the thing to do is to make sure the inventory reflects what happened and defer the warning to a place where we know if it's wrong. I thought blueprint planning could know both of these things. But as I write this, that doesn't make sense either, in that that's the policy then, not necessarily when the inventory was collected.

I guess this warning has to be heuristic in that the condition it's warning about is only wrong if the policy says the records should be there and they're not there and the system comes to rest in that state (i.e., it's not in the process of making it happen).

I'm not that worried about a couple of stray warnings while things are changing, just that these will be emitted every time inventory is collected. But I don't have a suggestion that seems obviously better. I probably would just leave it out, figuring it's a pretty uncommon case (though I know you just hit it, so that's probably not a credible argument -- but the cause there seems pretty unlikely) but I'm happy if you want to leave it in in case we see something like that again!

Thanks for the comprehensive reply @davepacheco. Yeah, I actually think it's worth just removing at this point. I don't anticipate this breaking again and I'll have the mechanism to debug it without going to logs ;)

Once again, I appreciate the thoughtfulness!

I fixed this up as discussed in dffe13e

I'm launching to test in a4x2 now.

Testing in a4x2 was successful. Inventory retrieved.

KEEPER MEMBERSHIP queried keeper: 1 leader_committed_log_index: 21 raft config: 1, 2, 3, 4, 5 queried keeper: 2 leader_committed_log_index: 27 raft config: 1, 2, 3, 4, 5 queried keeper: 3 leader_committed_log_index: 33 raft config: 1, 2, 3, 4, 5 queried keeper: 4 leader_committed_log_index: 45 raft config: 1, 2, 3, 4, 5 queried keeper: 5 leader_committed_log_index: 59 raft config: 1, 2, 3, 4, 5

wip

d7545ff

davepacheco mentioned this pull request Oct 29, 2024

DNS propagation broken in newly-deployed systems #6951

Closed

andrewjstone added 2 commits October 30, 2024 16:04

Merge branch 'main' into omdb-keeper-inventory

12640a9

wip

3368e72

andrewjstone changed the title ~~wip: collect keepers~~ omdb: Show clickhouse keeper membership in inventory Oct 30, 2024

fixup comment

ae924ec

andrewjstone requested review from davepacheco and jgallagher October 30, 2024 18:26

andrewjstone marked this pull request as ready for review October 30, 2024 18:26

andrewjstone mentioned this pull request Oct 30, 2024

Reconfigurator: Tracking issue for multi-node clickhouse deployment #5999

Open

20 tasks

davepacheco reviewed Oct 30, 2024

View reviewed changes

review fixes

dffe13e

andrewjstone requested a review from davepacheco November 5, 2024 00:01

Merge branch 'main' into omdb-keeper-inventory

ae22461

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

omdb: Show clickhouse keeper membership in inventory #6950

omdb: Show clickhouse keeper membership in inventory #6950

andrewjstone commented Oct 29, 2024 •

edited

Loading

davepacheco Oct 30, 2024

andrewjstone Oct 30, 2024

andrewjstone Oct 30, 2024

davepacheco Oct 30, 2024

andrewjstone Oct 30, 2024

andrewjstone Oct 30, 2024

andrewjstone Oct 31, 2024

omdb: Show clickhouse keeper membership in inventory #6950

Are you sure you want to change the base?

omdb: Show clickhouse keeper membership in inventory #6950

Conversation

andrewjstone commented Oct 29, 2024 • edited Loading

davepacheco Oct 30, 2024

Choose a reason for hiding this comment

andrewjstone Oct 30, 2024

Choose a reason for hiding this comment

andrewjstone Oct 30, 2024

Choose a reason for hiding this comment

davepacheco Oct 30, 2024

Choose a reason for hiding this comment

andrewjstone Oct 30, 2024

Choose a reason for hiding this comment

andrewjstone Oct 30, 2024

Choose a reason for hiding this comment

andrewjstone Oct 31, 2024

Choose a reason for hiding this comment

andrewjstone commented Oct 29, 2024 •

edited

Loading