Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance HA "Taking over", "Handing over" logging #692

Merged
merged 1 commit into from
Apr 4, 2024

Conversation

oxzi
Copy link
Member

@oxzi oxzi commented Mar 12, 2024

The reason for a switch in the HA roles was not always directly clear. This change now introduces additional logging, indicating the reasoning for either taking over or handing over the HA responsibility.

Next to additional logging messages, both the takeover and handover channel are now transporting a string to communicate the reason instead of an empty struct{}. By doing so, both the "Taking over" and "Handing over" log messages are enriched with a reason.

This also required a change in the suppressed logging handling of the HA.realize method, which got its logging enabled through the shouldLog parameter. Now, there are both recurring events, which might be suppressed, as well as state changing events, which should be logged. Therefore, and because the logTicker's functionality was not clear to me on first glance, I renamed it to routineLogTicker.

While dealing with the code, some function signature documentation were added, to ease both mine as well as the understanding of future readers.

Closes #688.

@oxzi oxzi added this to the 1.1.2 milestone Mar 12, 2024
@oxzi oxzi requested a review from yhabteab March 12, 2024 12:10
@cla-bot cla-bot bot added the cla/signed label Mar 12, 2024
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
@oxzi oxzi force-pushed the ha-logging-i688 branch from 95d2a80 to 7d25da7 Compare March 14, 2024 11:24
@oxzi oxzi requested a review from yhabteab March 14, 2024 11:39
@oxzi oxzi force-pushed the ha-logging-i688 branch from 7d25da7 to 9993621 Compare March 14, 2024 12:12
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
pkg/icingadb/ha.go Show resolved Hide resolved
@oxzi oxzi force-pushed the ha-logging-i688 branch from 9993621 to 4e4250e Compare March 19, 2024 11:20
@oxzi oxzi requested a review from yhabteab March 19, 2024 11:21
@oxzi
Copy link
Member Author

oxzi commented Mar 19, 2024

Additionally to addressing @yhabteab's comments, the error handling of the SQL query selecting Icinga DB instances was changed slightly to also handle wrapped sql.ErrNoRows errors.

case errors.Is(errQuery, sql.ErrNoRows):

yhabteab
yhabteab previously approved these changes Mar 19, 2024
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFTM! ↓

2024-03-19T14:31:26.547+0100    WARN    main    Handing over    {"reason": "context deadline exceeded"}
1c1
< 2024-03-19T12:36:45.209Z        INFO    high-availability       Another instance is active      {"instance_id": "7322e2b8658e446094c79c26b241af94", "environment": "e9698ca8a2cd336a55a948ea51903878d13869bd", "heartbeat": 1710851800743, "heartbeat_age": "4.466613808s"}
---
> 2024-03-19T13:52:55.135+0100    DEBUG   high-availability       Another instance is active      {"instance_id": "3525504452954c2eb7a2d4188c6cf01a", "environment": "e9698ca8a2cd336a55a948ea51903878d13869bd", "heartbeat": "2024-03-19T13:52:54.019+0100", "heartbeat_age": "1.116948947s"}

yhabteab
yhabteab previously approved these changes Mar 20, 2024
@lippserd lippserd self-requested a review March 24, 2024 20:18
Copy link
Member

@lippserd lippserd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, it's an improvement that the reasons are logged, but I don't like that the debug log now receives a lot of messages and that at least the first info message "another instance is active" is lost. Either we don't touch the levels and the timer altogether or we make proper logging for the four scenarios:

  • takeover
  • handover
  • this is the active instance
  • another instance is active

There must be at least one "another instance is active" message in the info logs. If this appears every minute, the user can quickly see that everything is fine. Also, I doubt that something like "this is the active instance" in the debug logs is helpful for debugging. Sure, if nothing is going on, then this at least shows that Icinga DB is not deadlocked or something, but then a timer is sufficient for that. This would be my suggestion:

  • takeover: info, log reason
  • handover: warn, log reason
  • this is the active instance: debug, all x
  • another instance is active: info, once and then all x

I don't mind to increase the timer to 5 minutes or something.

@oxzi
Copy link
Member Author

oxzi commented Mar 25, 2024

I guess a consensus is needed here. In an earlier review @yhabteab commented that minutely logs are too much noise. However, I think the "I am active" and the "Someone else is active" messages are being antagonistic and should be logged together, especially when a sysadmin wants to inspect what's going on. But I see your point, @lippserd, that it should be logged if someone else is being responsible.

Thus, I would alter the PR again to reintroduce the continuous logs with a bigger delay.

@yhabteab
Copy link
Member

I guess a consensus is needed here. In an earlier review @yhabteab commented that minutely logs are too much noise.

My comments were referring to the "I am active" part though, just like Eric's suggestion (this is the active instance: debug, all x) and since you seem to firmly think that both parts should be kept in sync, both ended up being demoted to debug log.

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing so, both the "Taking over" and "Handing over" log messages are enriched with a reason.

Well, not really in case of the taking over message given that there's only one call of signalTakeover() with the fixed reason "no other instance is active".

Most likely, that's due to the the fact that the current database query simply returns no rows in all takeover cases:

icingadb/pkg/icingadb/ha.go

Lines 261 to 262 in 2c46830

query := h.db.Rebind("SELECT id, heartbeat FROM icingadb_instance "+
"WHERE environment_id = ? AND responsible = ? AND id <> ? AND heartbeat > ?") + selectLock

IIRC, one aspect we wanted to improve with the logging was to have distinct log messages based on whether there was a "graceful" takeover (i.e. the previously active instance removed its active state by itself) or a "forceful" takeover (i.e. the current instance took over because the previously active one seems to have died and its heartbeat expired).

To be able to log this, the checks on the responsible and heartbeat columns would have to be moved from the where clause to Go code. At first glance, this sounds feasible, so I think it's worth a try.

@oxzi oxzi force-pushed the ha-logging-i688 branch from bae1ce2 to 447a74b Compare March 27, 2024 12:27
@oxzi
Copy link
Member Author

oxzi commented Mar 27, 2024

Valid point, @julianbrost. I have altered the commit and pushed a new version.

However, I don't see why the responsible column should be moved to the Go code. In the current logic only one row might be responsible, as during the takeover process an Icinga DB instances marks itself as responsible and all other instances as not responsible anymore within the same database transaction.
If I'm missing something, please show it to me.

Furthermore, I have introduced a grace period for a forceful takeover in case of an expired heartbeat, as otherwise one node might perform its hand over slightly after the other node's take over, as the following log snippet shows:

icingadb-master2  | 2024-03-27T11:58:59.280Z    INFO    icingadb        Taking over     {"reason": "other instance's heartbeat has expired"}
icingadb-master1  | 2024-03-27T11:58:59.304Z    WARN    icingadb        Handing over    {"reason": "lost heartbeat"}

@oxzi oxzi requested a review from julianbrost March 27, 2024 12:33
@oxzi oxzi marked this pull request as draft March 27, 2024 12:51
@oxzi oxzi force-pushed the ha-logging-i688 branch from 447a74b to 57c3ac9 Compare March 27, 2024 13:00
@oxzi
Copy link
Member Author

oxzi commented Mar 27, 2024

As the old code indicated a takeover on the fact that no other instance is active, it will now additionally check if it is already being the active/responsible node. In this case, the takeover logic - which will be interrupted at a later point as the node is already responsible - can be skipped.

diff --git a/pkg/icingadb/ha.go b/pkg/icingadb/ha.go
index f3bcad1..5c62268 100644
--- a/pkg/icingadb/ha.go
+++ b/pkg/icingadb/ha.go
@@ -302,11 +302,11 @@ func (h *HA) realize(
                                }

                        case errors.Is(errQuery, sql.ErrNoRows):
-                               takeover = "no other instance is active"
                                fields := []any{
                                        zap.String("instance_id", h.instanceId.String()),
                                        zap.String("environment", envId.String())}
                                if !h.responsible {
+                                       takeover = "no other instance is active"
                                        h.logger.Debugw("Preparing to take over HA as no instance is active", fields...)
                                } else if h.responsible && shouldLogRoutineEvents {
                                        h.logger.Debugw("Continuing being the active instance", fields...)

@oxzi oxzi marked this pull request as ready for review March 27, 2024 13:02
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I don't see why the responsible column should be moved to the Go code. In the current logic only one row might be responsible, as during the takeover process an Icinga DB instances marks itself as responsible and all other instances as not responsible anymore within the same database transaction. If I'm missing something, please show it to me.

Not really. I was just thinking that all other instances would have to be selected and considered. Indeed, to get the desired information, it's enough to only select if there's an instance that hasn't retracted responsible = 'y'.

pkg/icingadb/ha.go Outdated Show resolved Hide resolved
The reason for a switch in the HA roles was not always directly clear.
This change now introduces additional debug logging, indicating the
reasoning for either taking over or handing over the HA responsibility.

First, some logic was moved from the SQL query selecting active Icinga
DB instances to Go code. This allowed distinguishing between no
available responsible instances and responsible instances with an
expired heartbeat.

As the HA's peer timeout is logically bound to the Redis timeout, it
will now reference this timeout with an additional grace timeout. Doing
so eliminates a race between a handing over and a "forceful" take over.

As the old code indicated a takeover on the fact that no other instance
is active, it will now additionally check if it is already being the
active/responsible node. In this case, the takeover logic - which will
be interrupted at a later point as the node is already responsible - can
be skipped.

Next to the additional logging messages, both the takeover and handover
channel are now transporting a string to communicate the reason instead
of an empty struct{}. By doing so, both the "Taking over" and "Handing
over" log messages are enriched with reason.

This also required a change in the suppressed logging handling of the
HA.realize method, which got its logging enabled through the shouldLog
parameter. Now, there are both recurring events, which might be
suppressed, as well as state changing events, which should be logged.
Therefore, and because the logTicker's functionality was not clear to me
on first glance, I renamed it to routineLogTicker.

While dealing with the code, some function signature documentation were
added, to ease both mine as well as the understanding of future readers.

Additionally, the error handling of the SQL query selecting active
Icinga DB instances was changed slightly to also handle wrapped
sql.ErrNoRows errors.

Closes #688.
@oxzi oxzi force-pushed the ha-logging-i688 branch from 57c3ac9 to 779afd1 Compare April 2, 2024 11:25
@oxzi oxzi requested a review from julianbrost April 2, 2024 11:27
@lippserd lippserd self-requested a review April 3, 2024 09:54
@julianbrost julianbrost merged commit 80abf2b into main Apr 4, 2024
31 checks passed
@julianbrost julianbrost deleted the ha-logging-i688 branch April 4, 2024 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance HA take/hand over log messages
4 participants