Enhance HA "Taking over", "Handing over" logging #692

oxzi · 2024-03-12T12:10:04Z

The reason for a switch in the HA roles was not always directly clear. This change now introduces additional logging, indicating the reasoning for either taking over or handing over the HA responsibility.

Next to additional logging messages, both the takeover and handover channel are now transporting a string to communicate the reason instead of an empty struct{}. By doing so, both the "Taking over" and "Handing over" log messages are enriched with a reason.

This also required a change in the suppressed logging handling of the HA.realize method, which got its logging enabled through the shouldLog parameter. Now, there are both recurring events, which might be suppressed, as well as state changing events, which should be logged. Therefore, and because the logTicker's functionality was not clear to me on first glance, I renamed it to routineLogTicker.

While dealing with the code, some function signature documentation were added, to ease both mine as well as the understanding of future readers.

Closes #688.

pkg/icingadb/ha.go

oxzi · 2024-03-19T11:22:45Z

Additionally to addressing @yhabteab's comments, the error handling of the SQL query selecting Icinga DB instances was changed slightly to also handle wrapped sql.ErrNoRows errors.

icingadb/pkg/icingadb/ha.go

Line 294 in 4e4250e

case errors.Is(errQuery, sql.ErrNoRows):

yhabteab

LFTM! ↓

2024-03-19T14:31:26.547+0100    WARN    main    Handing over    {"reason": "context deadline exceeded"}

1c1
< 2024-03-19T12:36:45.209Z        INFO    high-availability       Another instance is active      {"instance_id": "7322e2b8658e446094c79c26b241af94", "environment": "e9698ca8a2cd336a55a948ea51903878d13869bd", "heartbeat": 1710851800743, "heartbeat_age": "4.466613808s"}
---
> 2024-03-19T13:52:55.135+0100    DEBUG   high-availability       Another instance is active      {"instance_id": "3525504452954c2eb7a2d4188c6cf01a", "environment": "e9698ca8a2cd336a55a948ea51903878d13869bd", "heartbeat": "2024-03-19T13:52:54.019+0100", "heartbeat_age": "1.116948947s"}

lippserd

All in all, it's an improvement that the reasons are logged, but I don't like that the debug log now receives a lot of messages and that at least the first info message "another instance is active" is lost. Either we don't touch the levels and the timer altogether or we make proper logging for the four scenarios:

takeover
handover
this is the active instance
another instance is active

There must be at least one "another instance is active" message in the info logs. If this appears every minute, the user can quickly see that everything is fine. Also, I doubt that something like "this is the active instance" in the debug logs is helpful for debugging. Sure, if nothing is going on, then this at least shows that Icinga DB is not deadlocked or something, but then a timer is sufficient for that. This would be my suggestion:

takeover: info, log reason
handover: warn, log reason
this is the active instance: debug, all x
another instance is active: info, once and then all x

I don't mind to increase the timer to 5 minutes or something.

oxzi · 2024-03-25T08:36:09Z

I guess a consensus is needed here. In an earlier review @yhabteab commented that minutely logs are too much noise. However, I think the "I am active" and the "Someone else is active" messages are being antagonistic and should be logged together, especially when a sysadmin wants to inspect what's going on. But I see your point, @lippserd, that it should be logged if someone else is being responsible.

Thus, I would alter the PR again to reintroduce the continuous logs with a bigger delay.

yhabteab · 2024-03-25T12:43:03Z

I guess a consensus is needed here. In an earlier review @yhabteab commented that minutely logs are too much noise.

My comments were referring to the "I am active" part though, just like Eric's suggestion (this is the active instance: debug, all x) and since you seem to firmly think that both parts should be kept in sync, both ended up being demoted to debug log.

julianbrost

By doing so, both the "Taking over" and "Handing over" log messages are enriched with a reason.

Well, not really in case of the taking over message given that there's only one call of signalTakeover() with the fixed reason "no other instance is active".

Most likely, that's due to the the fact that the current database query simply returns no rows in all takeover cases:

icingadb/pkg/icingadb/ha.go

Lines 261 to 262 in 2c46830

    
           query := h.db.Rebind("SELECT id, heartbeat FROM icingadb_instance "+ 
        
           	"WHERE environment_id = ? AND responsible = ? AND id <> ? AND heartbeat > ?") + selectLock

IIRC, one aspect we wanted to improve with the logging was to have distinct log messages based on whether there was a "graceful" takeover (i.e. the previously active instance removed its active state by itself) or a "forceful" takeover (i.e. the current instance took over because the previously active one seems to have died and its heartbeat expired).

To be able to log this, the checks on the responsible and heartbeat columns would have to be moved from the where clause to Go code. At first glance, this sounds feasible, so I think it's worth a try.

oxzi · 2024-03-27T12:33:46Z

Valid point, @julianbrost. I have altered the commit and pushed a new version.

However, I don't see why the responsible column should be moved to the Go code. In the current logic only one row might be responsible, as during the takeover process an Icinga DB instances marks itself as responsible and all other instances as not responsible anymore within the same database transaction.
If I'm missing something, please show it to me.

Furthermore, I have introduced a grace period for a forceful takeover in case of an expired heartbeat, as otherwise one node might perform its hand over slightly after the other node's take over, as the following log snippet shows:

icingadb-master2  | 2024-03-27T11:58:59.280Z    INFO    icingadb        Taking over     {"reason": "other instance's heartbeat has expired"}
icingadb-master1  | 2024-03-27T11:58:59.304Z    WARN    icingadb        Handing over    {"reason": "lost heartbeat"}

oxzi · 2024-03-27T13:01:57Z

As the old code indicated a takeover on the fact that no other instance is active, it will now additionally check if it is already being the active/responsible node. In this case, the takeover logic - which will be interrupted at a later point as the node is already responsible - can be skipped.

diff --git a/pkg/icingadb/ha.go b/pkg/icingadb/ha.go
index f3bcad1..5c62268 100644
--- a/pkg/icingadb/ha.go
+++ b/pkg/icingadb/ha.go
@@ -302,11 +302,11 @@ func (h *HA) realize(
                                }

                        case errors.Is(errQuery, sql.ErrNoRows):
-                               takeover = "no other instance is active"
                                fields := []any{
                                        zap.String("instance_id", h.instanceId.String()),
                                        zap.String("environment", envId.String())}
                                if !h.responsible {
+                                       takeover = "no other instance is active"
                                        h.logger.Debugw("Preparing to take over HA as no instance is active", fields...)
                                } else if h.responsible && shouldLogRoutineEvents {
                                        h.logger.Debugw("Continuing being the active instance", fields...)

julianbrost

However, I don't see why the responsible column should be moved to the Go code. In the current logic only one row might be responsible, as during the takeover process an Icinga DB instances marks itself as responsible and all other instances as not responsible anymore within the same database transaction. If I'm missing something, please show it to me.

Not really. I was just thinking that all other instances would have to be selected and considered. Indeed, to get the desired information, it's enough to only select if there's an instance that hasn't retracted responsible = 'y'.

pkg/icingadb/ha.go

The reason for a switch in the HA roles was not always directly clear. This change now introduces additional debug logging, indicating the reasoning for either taking over or handing over the HA responsibility. First, some logic was moved from the SQL query selecting active Icinga DB instances to Go code. This allowed distinguishing between no available responsible instances and responsible instances with an expired heartbeat. As the HA's peer timeout is logically bound to the Redis timeout, it will now reference this timeout with an additional grace timeout. Doing so eliminates a race between a handing over and a "forceful" take over. As the old code indicated a takeover on the fact that no other instance is active, it will now additionally check if it is already being the active/responsible node. In this case, the takeover logic - which will be interrupted at a later point as the node is already responsible - can be skipped. Next to the additional logging messages, both the takeover and handover channel are now transporting a string to communicate the reason instead of an empty struct{}. By doing so, both the "Taking over" and "Handing over" log messages are enriched with reason. This also required a change in the suppressed logging handling of the HA.realize method, which got its logging enabled through the shouldLog parameter. Now, there are both recurring events, which might be suppressed, as well as state changing events, which should be logged. Therefore, and because the logTicker's functionality was not clear to me on first glance, I renamed it to routineLogTicker. While dealing with the code, some function signature documentation were added, to ease both mine as well as the understanding of future readers. Additionally, the error handling of the SQL query selecting active Icinga DB instances was changed slightly to also handle wrapped sql.ErrNoRows errors. Closes #688.

pkg/icingadb/ha.go

oxzi added this to the 1.1.2 milestone Mar 12, 2024

oxzi requested a review from yhabteab March 12, 2024 12:10

cla-bot bot added the cla/signed label Mar 12, 2024

yhabteab reviewed Mar 14, 2024

View reviewed changes

pkg/icingadb/ha.go Show resolved Hide resolved

yhabteab reviewed Mar 14, 2024

View reviewed changes

pkg/icingadb/ha.go Outdated Show resolved Hide resolved

pkg/icingadb/ha.go Outdated Show resolved Hide resolved

oxzi force-pushed the ha-logging-i688 branch from 95d2a80 to 7d25da7 Compare March 14, 2024 11:24

oxzi requested a review from yhabteab March 14, 2024 11:39

oxzi force-pushed the ha-logging-i688 branch from 7d25da7 to 9993621 Compare March 14, 2024 12:12

yhabteab reviewed Mar 18, 2024

View reviewed changes

pkg/icingadb/ha.go Outdated Show resolved Hide resolved

pkg/icingadb/ha.go Outdated Show resolved Hide resolved

pkg/icingadb/ha.go Outdated Show resolved Hide resolved

pkg/icingadb/ha.go Show resolved Hide resolved

oxzi force-pushed the ha-logging-i688 branch from 9993621 to 4e4250e Compare March 19, 2024 11:20

oxzi requested a review from yhabteab March 19, 2024 11:21

yhabteab previously approved these changes Mar 19, 2024

View reviewed changes

oxzi dismissed yhabteab’s stale review via cd82b1b March 20, 2024 08:21

oxzi force-pushed the ha-logging-i688 branch from 4e4250e to cd82b1b Compare March 20, 2024 08:21

oxzi requested review from julianbrost and yhabteab March 20, 2024 08:22

yhabteab previously approved these changes Mar 20, 2024

View reviewed changes

lippserd self-requested a review March 24, 2024 20:18

lippserd requested changes Mar 24, 2024

View reviewed changes

oxzi dismissed yhabteab’s stale review via bae1ce2 March 25, 2024 15:48

oxzi force-pushed the ha-logging-i688 branch from cd82b1b to bae1ce2 Compare March 25, 2024 15:48

oxzi requested review from yhabteab and lippserd March 25, 2024 15:48

julianbrost reviewed Mar 26, 2024

View reviewed changes

oxzi force-pushed the ha-logging-i688 branch from bae1ce2 to 447a74b Compare March 27, 2024 12:27

oxzi requested a review from julianbrost March 27, 2024 12:33

oxzi marked this pull request as draft March 27, 2024 12:51

oxzi force-pushed the ha-logging-i688 branch from 447a74b to 57c3ac9 Compare March 27, 2024 13:00

oxzi marked this pull request as ready for review March 27, 2024 13:02

julianbrost reviewed Mar 27, 2024

View reviewed changes

pkg/icingadb/ha.go Outdated Show resolved Hide resolved

oxzi force-pushed the ha-logging-i688 branch from 57c3ac9 to 779afd1 Compare April 2, 2024 11:25

oxzi requested a review from julianbrost April 2, 2024 11:27

julianbrost approved these changes Apr 3, 2024

View reviewed changes

lippserd reviewed Apr 3, 2024

View reviewed changes

pkg/icingadb/ha.go Show resolved Hide resolved

lippserd self-requested a review April 3, 2024 09:54

lippserd approved these changes Apr 3, 2024

View reviewed changes

julianbrost merged commit 80abf2b into main Apr 4, 2024
31 checks passed

julianbrost deleted the ha-logging-i688 branch April 4, 2024 08:33

julianbrost mentioned this pull request Apr 11, 2024

Release Version 1.2.0 #734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance HA "Taking over", "Handing over" logging #692

Enhance HA "Taking over", "Handing over" logging #692

oxzi commented Mar 12, 2024

oxzi commented Mar 19, 2024 •

edited

Loading

yhabteab left a comment •

edited

Loading

lippserd left a comment

oxzi commented Mar 25, 2024

yhabteab commented Mar 25, 2024

julianbrost left a comment

oxzi commented Mar 27, 2024

oxzi commented Mar 27, 2024

julianbrost left a comment

	query := h.db.Rebind("SELECT id, heartbeat FROM icingadb_instance "+
	"WHERE environment_id = ? AND responsible = ? AND id <> ? AND heartbeat > ?") + selectLock

Enhance HA "Taking over", "Handing over" logging #692

Enhance HA "Taking over", "Handing over" logging #692

Conversation

oxzi commented Mar 12, 2024

oxzi commented Mar 19, 2024 • edited Loading

yhabteab left a comment • edited Loading

Choose a reason for hiding this comment

lippserd left a comment

Choose a reason for hiding this comment

oxzi commented Mar 25, 2024

yhabteab commented Mar 25, 2024

julianbrost left a comment

Choose a reason for hiding this comment

oxzi commented Mar 27, 2024

oxzi commented Mar 27, 2024

julianbrost left a comment

Choose a reason for hiding this comment

oxzi commented Mar 19, 2024 •

edited

Loading

yhabteab left a comment •

edited

Loading