-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance HA "Taking over", "Handing over" logging #692
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LFTM! ↓
2024-03-19T14:31:26.547+0100 WARN main Handing over {"reason": "context deadline exceeded"}
1c1
< 2024-03-19T12:36:45.209Z INFO high-availability Another instance is active {"instance_id": "7322e2b8658e446094c79c26b241af94", "environment": "e9698ca8a2cd336a55a948ea51903878d13869bd", "heartbeat": 1710851800743, "heartbeat_age": "4.466613808s"}
---
> 2024-03-19T13:52:55.135+0100 DEBUG high-availability Another instance is active {"instance_id": "3525504452954c2eb7a2d4188c6cf01a", "environment": "e9698ca8a2cd336a55a948ea51903878d13869bd", "heartbeat": "2024-03-19T13:52:54.019+0100", "heartbeat_age": "1.116948947s"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all, it's an improvement that the reasons are logged, but I don't like that the debug log now receives a lot of messages and that at least the first info message "another instance is active" is lost. Either we don't touch the levels and the timer altogether or we make proper logging for the four scenarios:
- takeover
- handover
- this is the active instance
- another instance is active
There must be at least one "another instance is active" message in the info logs. If this appears every minute, the user can quickly see that everything is fine. Also, I doubt that something like "this is the active instance" in the debug logs is helpful for debugging. Sure, if nothing is going on, then this at least shows that Icinga DB is not deadlocked or something, but then a timer is sufficient for that. This would be my suggestion:
- takeover: info, log reason
- handover: warn, log reason
- this is the active instance: debug, all x
- another instance is active: info, once and then all x
I don't mind to increase the timer to 5 minutes or something.
I guess a consensus is needed here. In an earlier review @yhabteab commented that minutely logs are too much noise. However, I think the "I am active" and the "Someone else is active" messages are being antagonistic and should be logged together, especially when a sysadmin wants to inspect what's going on. But I see your point, @lippserd, that it should be logged if someone else is being responsible. Thus, I would alter the PR again to reintroduce the continuous logs with a bigger delay. |
My comments were referring to the "I am active" part though, just like Eric's suggestion (this is the active instance: debug, all x) and since you seem to firmly think that both parts should be kept in sync, both ended up being demoted to debug log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing so, both the "Taking over" and "Handing over" log messages are enriched with a reason.
Well, not really in case of the taking over message given that there's only one call of signalTakeover()
with the fixed reason "no other instance is active".
Most likely, that's due to the the fact that the current database query simply returns no rows in all takeover cases:
Lines 261 to 262 in 2c46830
query := h.db.Rebind("SELECT id, heartbeat FROM icingadb_instance "+ | |
"WHERE environment_id = ? AND responsible = ? AND id <> ? AND heartbeat > ?") + selectLock |
IIRC, one aspect we wanted to improve with the logging was to have distinct log messages based on whether there was a "graceful" takeover (i.e. the previously active instance removed its active state by itself) or a "forceful" takeover (i.e. the current instance took over because the previously active one seems to have died and its heartbeat expired).
To be able to log this, the checks on the responsible
and heartbeat
columns would have to be moved from the where
clause to Go code. At first glance, this sounds feasible, so I think it's worth a try.
Valid point, @julianbrost. I have altered the commit and pushed a new version. However, I don't see why the Furthermore, I have introduced a grace period for a forceful takeover in case of an expired heartbeat, as otherwise one node might perform its hand over slightly after the other node's take over, as the following log snippet shows:
|
As the old code indicated a takeover on the fact that no other instance is active, it will now additionally check if it is already being the active/responsible node. In this case, the takeover logic - which will be interrupted at a later point as the node is already responsible - can be skipped. diff --git a/pkg/icingadb/ha.go b/pkg/icingadb/ha.go
index f3bcad1..5c62268 100644
--- a/pkg/icingadb/ha.go
+++ b/pkg/icingadb/ha.go
@@ -302,11 +302,11 @@ func (h *HA) realize(
}
case errors.Is(errQuery, sql.ErrNoRows):
- takeover = "no other instance is active"
fields := []any{
zap.String("instance_id", h.instanceId.String()),
zap.String("environment", envId.String())}
if !h.responsible {
+ takeover = "no other instance is active"
h.logger.Debugw("Preparing to take over HA as no instance is active", fields...)
} else if h.responsible && shouldLogRoutineEvents {
h.logger.Debugw("Continuing being the active instance", fields...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I don't see why the
responsible
column should be moved to the Go code. In the current logic only one row might beresponsible
, as during the takeover process an Icinga DB instances marks itself asresponsible
and all other instances as notresponsible
anymore within the same database transaction. If I'm missing something, please show it to me.
Not really. I was just thinking that all other instances would have to be selected and considered. Indeed, to get the desired information, it's enough to only select if there's an instance that hasn't retracted responsible = 'y'
.
The reason for a switch in the HA roles was not always directly clear. This change now introduces additional debug logging, indicating the reasoning for either taking over or handing over the HA responsibility. First, some logic was moved from the SQL query selecting active Icinga DB instances to Go code. This allowed distinguishing between no available responsible instances and responsible instances with an expired heartbeat. As the HA's peer timeout is logically bound to the Redis timeout, it will now reference this timeout with an additional grace timeout. Doing so eliminates a race between a handing over and a "forceful" take over. As the old code indicated a takeover on the fact that no other instance is active, it will now additionally check if it is already being the active/responsible node. In this case, the takeover logic - which will be interrupted at a later point as the node is already responsible - can be skipped. Next to the additional logging messages, both the takeover and handover channel are now transporting a string to communicate the reason instead of an empty struct{}. By doing so, both the "Taking over" and "Handing over" log messages are enriched with reason. This also required a change in the suppressed logging handling of the HA.realize method, which got its logging enabled through the shouldLog parameter. Now, there are both recurring events, which might be suppressed, as well as state changing events, which should be logged. Therefore, and because the logTicker's functionality was not clear to me on first glance, I renamed it to routineLogTicker. While dealing with the code, some function signature documentation were added, to ease both mine as well as the understanding of future readers. Additionally, the error handling of the SQL query selecting active Icinga DB instances was changed slightly to also handle wrapped sql.ErrNoRows errors. Closes #688.
The reason for a switch in the HA roles was not always directly clear. This change now introduces additional logging, indicating the reasoning for either taking over or handing over the HA responsibility.
Next to additional logging messages, both the takeover and handover channel are now transporting a string to communicate the reason instead of an empty struct{}. By doing so, both the "Taking over" and "Handing over" log messages are enriched with a reason.
This also required a change in the suppressed logging handling of the HA.realize method, which got its logging enabled through the shouldLog parameter. Now, there are both recurring events, which might be suppressed, as well as state changing events, which should be logged. Therefore, and because the logTicker's functionality was not clear to me on first glance, I renamed it to routineLogTicker.
While dealing with the code, some function signature documentation were added, to ease both mine as well as the understanding of future readers.
Closes #688.