Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA: Limit max open database connections to 1 #828

Merged
merged 1 commit into from
Dec 17, 2024
Merged

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Oct 23, 2024

Previously, the HA feature was allowed to open max_connections database connection in parallel to other Icinga DB components. Meaning, Icinga DB wasn't limited to the configured max_connections, but effectively to 2 * max_connections.

@yhabteab yhabteab added the bug Something isn't working label Oct 23, 2024
@yhabteab yhabteab added this to the 1.3.0 milestone Oct 23, 2024
@yhabteab yhabteab requested a review from lippserd October 23, 2024 12:24
@cla-bot cla-bot bot added the cla/signed label Oct 23, 2024
@yhabteab
Copy link
Member Author

The HA transaction is rolled back correctly with:

@yhabteab yhabteab changed the title fix(HA): Limit max open database connections to 1 HA: Limit max open database connections to 1 Oct 23, 2024
@yhabteab
Copy link
Member Author

yhabteab commented Oct 23, 2024

This is the proof that we actually leak database connections but only for as long as the context is not cancelled.

package main

import (
	"context"
	"github.com/icinga/icinga-go-library/logging"
	"github.com/icinga/icingadb/internal/command"
	"go.uber.org/zap"
	"log"
	"time"
)

func main() {
	cmd := command.New()

	db, err := cmd.Database(logging.NewLogger(zap.NewNop().Sugar(), time.Second))
	if err != nil {
		log.Fatal(err)
	}
	db.SetMaxOpenConns(3)

	go func() {
		time.Sleep(1 * time.Second)

		ctx, _ := context.WithDeadline(context.Background(), time.Now().Add(10*time.Second))
		_, _ = db.BeginTxx(ctx, nil)
		return
	}()
	go func() {
		time.Sleep(1 * time.Second)

		ctx, _ := context.WithDeadline(context.Background(), time.Now().Add(10*time.Second))
		_, _ = db.BeginTxx(ctx, nil)
		return
	}()

	time.Sleep(3 * time.Second)

	ctx, cancelFunc := context.WithDeadline(context.Background(), time.Now().Add(2*time.Second))
	defer cancelFunc()
	_ = db.PingContext(ctx)
	log.Printf("OpenConnections: %v", db.Stats().OpenConnections)
	log.Printf("InUse: %v", db.Stats().InUse)
}

~/Workspace/go/icingadb (set-ha-conn-limit ✗) go run ./cmd/decoder/decoder.go --config config.example.yml
2024/10/23 16:19:59 OpenConnections: 3
2024/10/23 16:19:59 InUse: 2

@yhabteab yhabteab requested review from lippserd and removed request for lippserd October 23, 2024 14:25
@@ -63,6 +63,12 @@ do not need any manual adjustments.
| max_rows_per_transaction | **Optional.** Maximum number of rows Icinga DB is allowed to `SELECT`,`DELETE`,`UPDATE` or `INSERT` in a single transaction. Defaults to `8192`. |
| wsrep_sync_wait | **Optional.** Enforce [Galera cluster](#galera-cluster) nodes to perform strict cluster-wide causality checks. Defaults to `7`. |

!!! info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this information as it is an implementation detail. Sure, users might be surprised if there are 17 database connections, but I doubt anyone will ever count them.

Instead, we could reduce the size of the main pool by one. But I wouldn't attach too much importance to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the size of the main pool is initialized in the library, we don't want to adjust it just because of Icinga DB's HA feature. Thus, I removed the highlighted info box.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example configuration would also have to be adjusted. But please just remove it completely. It is irrelevant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, I have completely removed the extra info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to summarize our discussion: With this PR, we currently have only one connection that isn't tied to max_connections. However, there's a chance more could be added or changed, leaving it up to the user to determine and maintain the correct max_connections value if the goal is to truly limit Icinga DB database connections to that number. Therefore, if max_connections is intended to represent the actual maximum connections, it should be implemented accordingly, rather than documented that some connections are unaffected by that setting.

@oxzi
Copy link
Member

oxzi commented Dec 5, 2024

Additionally, this masked a serious bug in the HA#realize() method, where we start a new transaction after each retry without rolling back in case of an error, leading to connections not being released before exceeding the ctx deadline.

This bug is being addressed in #800. In spite of other changes over there, I am a bit critical about reducing the amount of database connections to a single one. Currently, there is still the possibility of a long-running COMMIT to the database which cannot be canceled, as described in this comment. Could such a scenario then result into a blocking state if there is just one connection available? I guess no, but my gut instinct is uncertain.

@lippserd
Copy link
Member

Additionally, this masked a serious bug in the HA#realize() method, where we start a new transaction after each retry without rolling back in case of an error, leading to connections not being released before exceeding the ctx deadline.

This bug is being addressed in #800. In spite of other changes over there, I am a bit critical about reducing the amount of database connections to a single one. Currently, there is still the possibility of a long-running COMMIT to the database which cannot be canceled, as described in this comment. Could such a scenario then result into a blocking state if there is just one connection available? I guess no, but my gut instinct is uncertain.

With a long-running commit and more than one database connection, successive transactions would be blocked during the SELECT. With only one database connection, the code waits at the begin transaction point. To me this is a valid change as we are not opening new database connections unnecessarily. But since PR #800 was merged, the PR and commit description should reflect that this change is only about that. @yhabteab Would you please adjust that accordingly?

@lippserd
Copy link
Member

@yhabteab With both MySQL and PostgreSQL integration tests failing, I think something is wrong 😆

@oxzi
Copy link
Member

oxzi commented Dec 17, 2024

I followed @yhabteab's statement that this PR breaks when used together with #800 and tried a bisect with this PR's change on each commit of my PR. Indeed, dd0ca8f broke it. Let me debug it further :)

$ git bisect start HEAD HEAD~6 --
Bisecting: 2 revisions left to test after this (roughly 2 steps)
[f8819208ce4856a83d627d703f749e040c4d2888] HA: Deferred SQL Transaction Rollback


$ cat 828.sh
#!/bin/sh

set -eux

git cherry-pick --no-commit set-ha-conn-limit

CGO_ENABLED=0 go build ./cmd/icingadb

pushd tests
go test -o ../icingadb-test -c .
popd

set +e
export ICINGADB_TESTS_DATABASE_TYPE="mysql"
export ICINGA_TESTING_ICINGADB_BINARY="$(pwd)/icingadb"
export ICINGA_TESTING_ICINGADB_SCHEMA_MYSQL="$(pwd)/schema/mysql/schema.sql"
./icingadb-test -icingatesting.debuglog debug.log -test.v -test.run TestCleanupAndRetention
RET="$?"
set -e

git restore --staged cmd/icingadb/main.go
git restore cmd/icingadb/main.go

exit "$RET"


$ git bisect run ./828.sh
[ . . . ]
dd0ca8fb072ae8713b7e3f32df5b6417aa0fb7b6 is the first bad commit
commit dd0ca8fb072ae8713b7e3f32df5b6417aa0fb7b6
Author: Alvar Penning <[email protected]>
Date:   Fri Oct 25 10:39:05 2024 +0200

    HA: Insert environment within retryable function

    The HA.insertEnvironment() method was inlined into the retryable
    function to use the deadlined context. Otherwise, this might block
    afterwards, as it was used within HA.realize(), but without the passed
    context.

 pkg/icingadb/ha.go | 27 +++++++--------------------
 1 file changed, 7 insertions(+), 20 deletions(-)
bisect found first bad commit

oxzi added a commit that referenced this pull request Dec 17, 2024
During #800, the commit dd0ca8f inlined
the HA.insertEnvironment method into the retryable realization function.
However, while doing so, I forgot to change the query execution context
from h.db to tx. This resulted in an error when being used together with
a single database connection, as introduced in #828.

Co-Authored-By: Yonas Habteab <[email protected]>
@oxzi
Copy link
Member

oxzi commented Dec 17, 2024

After a bit of debugging together with @yhabteab, we have identified a bug and drafted #878.

@oxzi
Copy link
Member

oxzi commented Dec 17, 2024

@yhabteab: Could you please rebase again. Thanks :)

Previously, the HA feature was allowed to open `max_connections`
database connection in parallel to other Icinga DB components. Meaning,
Icinga DB wasn't limited to the configured `max_connections`, but
effectively to `2 * max_connections`.
@lippserd
Copy link
Member

@yhabteab: Could you please rebase again. Thanks :)

Already did.

@lippserd lippserd merged commit cff2ba3 into main Dec 17, 2024
32 checks passed
@lippserd lippserd deleted the set-ha-conn-limit branch December 17, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla/signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants