Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omdb: Show clickhouse keeper membership in inventory #6950

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Oct 29, 2024

Also fix a bug related to DNS lookup of clickhouse-admin-keeper servers.

Fixes #6945

@andrewjstone andrewjstone changed the title wip: collect keepers omdb: Show clickhouse keeper membership in inventory Oct 30, 2024
Comment on lines 151 to 159
warn!(
opctx.log,
concat!(
"Failed to lookup ClickhouseAdminKeeper in internal DNS:",
" {}. Is it enabled via policy?"
),
err
);
vec![]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easy to distinguish here between "found nothing" vs. some other DNS error? It seems like we'd want to treat some other DNS error the way we treat an error to look up MGS.

I don't love adding a warning-level message for a condition that's expected on many systems today. People might reasonably think something is wrong (especially since it's going to be repeated). Where's the place where we'd be in a position to know whether this is wrong? Would it be in Reconfigurator, when we look at this information? Maybe that's where the warning should go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easy to distinguish here between "found nothing" vs. some other DNS error? It seems like we'd want to treat some other DNS error the way we treat an error to look up MGS.

Good point, so return early in that case it's not "found nothing".

I don't love adding a warning-level message for a condition that's expected on many systems today. People might reasonably think something is wrong (especially since it's going to be repeated). Where's the place where we'd be in a position to know whether this is wrong? Would it be in Reconfigurator, when we look at this information? Maybe that's where the warning should go?

I went back and forth on this a bunch. Originally I had this as a debug message with a comment about changing it to a warning when we expect for multi-node clickhouse to always be deployed. But even in this case, we won't deploy it immediately until reconfigurator runs, and so we'd see a few warning messages. That doesn't seem that bad. We have lots of warnings that work like that now. I changed it to a warning in this PR thinking that even with a github issue I'd be likely to forget this altogether and never change from debug to warning. All that being said, this did make me uncomfortable for the reasons you mention.

We have two extra checks we could make to see if we should have DNS records or not: One is to check the clickhouse-policy in the datastore. We could do that easily enough here. Again, until reconfigurator runs and actually updates DNS and it propagates, we will still see this warning. The other is to do as you suggest and check the blueprint, but we wouldn't necessarily want to do that here, and I think it would have the same problem.

Where's the place where we'd be in a position to know whether this is wrong? Would it be in Reconfigurator, when we look at this information? Maybe that's where the warning should go?

I put this here particularly because I was having trouble getting back records that reconfigurator put there. This is a nice local error that tells me where to look. I think putting it in reconfigurator is somewhat misguided, and in any case we don't know why the records didn't come back, just that they didn't. It could be a network partition, it could be that the keepers didn't start yet, or it could be DNS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the place where we'd be in a position to know whether this is wrong? Would it be in Reconfigurator, when we look at this information? Maybe that's where the warning should go?

Oh, are you saying to not just check the inventory, but to actually do the dns lookups and see if they succeed in the blueprint? That's kind odd since blueprint generation only happens on demand currently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, honestly. What I was thinking is: ideally the warning should be in a place where we know both (1) whether there should be records, and (2) whether there were records. Here we know if there were records, but not if there should have been any, so I was thinking maybe the thing to do is to make sure the inventory reflects what happened and defer the warning to a place where we know if it's wrong. I thought blueprint planning could know both of these things. But as I write this, that doesn't make sense either, in that that's the policy then, not necessarily when the inventory was collected.

I guess this warning has to be heuristic in that the condition it's warning about is only wrong if the policy says the records should be there and they're not there and the system comes to rest in that state (i.e., it's not in the process of making it happen).

I'm not that worried about a couple of stray warnings while things are changing, just that these will be emitted every time inventory is collected. But I don't have a suggestion that seems obviously better. I probably would just leave it out, figuring it's a pretty uncommon case (though I know you just hit it, so that's probably not a credible argument -- but the cause there seems pretty unlikely) but I'm happy if you want to leave it in in case we see something like that again!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comprehensive reply @davepacheco. Yeah, I actually think it's worth just removing at this point. I don't anticipate this breaking again and I'll have the mechanism to debug it without going to logs ;)

Once again, I appreciate the thoughtfulness!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this up as discussed in dffe13e

I'm launching to test in a4x2 now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing in a4x2 was successful. Inventory retrieved.

KEEPER MEMBERSHIP

    queried keeper: 1
    leader_committed_log_index: 21
    raft config: 1, 2, 3, 4, 5

    queried keeper: 2
    leader_committed_log_index: 27
    raft config: 1, 2, 3, 4, 5

    queried keeper: 3
    leader_committed_log_index: 33
    raft config: 1, 2, 3, 4, 5

    queried keeper: 4
    leader_committed_log_index: 45
    raft config: 1, 2, 3, 4, 5

    queried keeper: 5
    leader_committed_log_index: 59
    raft config: 1, 2, 3, 4, 5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keeper expungement failing on a4x2
2 participants