Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistence enhancements - including adding PostgreSQL support #85

Merged
merged 49 commits into from
Jun 30, 2023

Conversation

peterbroadhurst
Copy link
Contributor

@peterbroadhurst peterbroadhurst commented Jun 6, 2023

Work in progress - release should be marked v1.3 when complete/merged.

Depends on hyperledger/firefly-common#71

Potential migration impacts

  • When PSQL is enabled, a (rather hard to work out why) data structure called transactionHeaders on the managed transaction that repeats some of the base fields, but only populates others, has been removed.
    • With LevelDB this data structure is populated for backwards compatibility with any apps in those environments
  • transactions.nonceStateTimeout previously deprecated, re-instated
    • transactions.handler.simple.nonceStateTimeout removed
  • The receipt section is only populated on /transactions/{transactionId} not on the collection query
  • The confirmations section is only populated on /transactions/{transactionId} not on the collection query
  • The history section is only populated on /transactions/{transactionId} not on the collection query
  • The summary section of a transaction is marked deprecated, and only populated for LevelDB - not implemented at all for PSQL
  • Confirmations are dispatched as they are detected, rather than right at the end
    • This means for transactions with a lot of confirmations (on public chains) you'll be able to track progress easier

Persistence changes

Making as pragmatic as possible change to the persistence mechanism, to support PostgreSQL as an alternative to LevelDB when persisting transactions, checkpoints, subscriptions, and listeners.

Aiming to make the impact on policy engine extensions as minimal as possible, but there are some changes shaping up:

  • TXHistory moving into persistence, as implementation will be fundamentally different for PSQL to LevelDB
    • Updates to substatus history & actions are discrete operations
    • Can now return error and only take the ID of the ManagedTX object (not the full object)
    • History limits only likely to apply to LevelDB (as expensive to implement in SQL)
  • Reconciling the ManagedTX object into core object fields, and related objects that can be merged in during a GET operation for backwards compatibility
    • History / HistorySummary were the big things
    • TransactionHeaders ended up a bit confused, so I've attempted to reconcile it (more info owed by me here)
    • Receipt - TBD if this stays separate
      - LevelDB continues to be a single fat JSON payload
  • Rather than a single WriteTransaction function with a new boolean, moved to InsertTransaction and UpdateTransaction - with clearer update semantics
    - This avoid re-writing huge JSON blobs to SQL DBs when simply tweaking a status field
  • Updating Confirmations notification system:
    - Notifying for all confirmations as we detect them, rather than only once the full set is completed (UX improvement)
    - Providing a more rich confirmations notification that can distinguish incremental adding of confirmations, from a new fork

Other changes

  • Parallel receipt query workers.
    • In testing around this persistence change (which uses parallel writers), we found that there was a bottleneck on the confirmation processor in checking for receipts. This was done sequentially on the main routine. So this PR adds a parallel worker engine for that, which uses a list (deliberately not a channel) of stale receipts to work through.

Signed-off-by: Peter Broadhurst <[email protected]>
Signed-off-by: Peter Broadhurst <[email protected]>
Signed-off-by: Peter Broadhurst <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Jun 6, 2023

Codecov Report

Merging #85 (de4c9b7) into main (b14c2cf) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##             main      #85     +/-   ##
=========================================
  Coverage   99.97%   99.97%             
=========================================
  Files          61       75     +14     
  Lines        3504     4914   +1410     
=========================================
+ Hits         3503     4913   +1410     
  Misses          1        1             
Impacted Files Coverage Δ
pkg/txhandler/simple/config.go 100.00% <ø> (ø)
internal/confirmations/confirmations.go 100.00% <100.00%> (ø)
internal/confirmations/receipt_checker.go 100.00% <100.00%> (ø)
internal/events/eventstream.go 100.00% <100.00%> (ø)
...nternal/persistence/leveldb/leveldb_persistence.go 100.00% <100.00%> (ø)
internal/persistence/leveldb/nonces.go 100.00% <100.00%> (ø)
internal/persistence/persistence.go 100.00% <100.00%> (ø)
internal/persistence/postgres/checkpoints.go 100.00% <100.00%> (ø)
internal/persistence/postgres/confirmations.go 100.00% <100.00%> (ø)
internal/persistence/postgres/eventstreams.go 100.00% <100.00%> (ø)
... and 26 more

Signed-off-by: Peter Broadhurst <[email protected]>
Signed-off-by: Peter Broadhurst <[email protected]>
Signed-off-by: Peter Broadhurst <[email protected]>
Signed-off-by: Peter Broadhurst <[email protected]>
Signed-off-by: Peter Broadhurst <[email protected]>
@peterbroadhurst
Copy link
Contributor Author

peterbroadhurst commented Jun 22, 2023

There are two parts to the 409 processing:

  1. The issue encountered in Postgres concurrency:true firefly#1342 where the return code ends up being 500 for an idempotency check in some edge cases. This is because we don't do the 409 ID check early enough in processing.

    • This is not unique to PSQL - can occur with LevelDB
    • There is a cost to this check, but the behavior is important. It is done on the API routine, rather than the writing worker - so it's a dirty read check against the DB. Intent is this is inexpensive, but that does need verification.
    • Addressed in 346cd5c
  2. The 500 rather than 409 in the case we get all the way to the DB and need the protection there. This is more of an edge case, but it is the final important place for the check.

    • This is unique to PSQL, as the check already exists with LevelDB (using a mutex)
    • For PSQL added in 8ba70ee
    • The code accepts a very small edge case where we might still return 500, that is not expected in normal operation

@peterbroadhurst peterbroadhurst marked this pull request as ready for review June 25, 2023 21:34
.golangci.yml Outdated
@@ -11,6 +11,11 @@ linters-settings:
gosec:
excludes:
- G402
revive:
rules:
- name: unused-parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but why disable this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me go back to this - in a recent update (I think coupled to 1.19 maybe) unused parameters on all functions have to be renamed down to _. I have been updating all the repos where I find it. I can't remember why I didn't just do that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I put it back, and it only found one remaining issue. So I probably did it temporarily and forgot to remove it - thanks for the catch 👍

@@ -44,6 +45,7 @@ var (
ConfigConfirmationsNotificationsQueueLength = ffc("config.confirmations.notificationQueueLength", "Internal queue length for notifying the confirmations manager of new transactions/events", i18n.IntType)
ConfigConfirmationsRequired = ffc("config.confirmations.required", "Number of confirmations required to consider a transaction/event final", i18n.IntType)
ConfigConfirmationsStaleReceiptTimeout = ffc("config.confirmations.staleReceiptTimeout", "Duration after which to force a receipt check for a pending transaction", i18n.TimeDurationType)
ConfigConfirmationsReceiptWorkers = ffc("config.confirmations.receiptWorkers", "Number of workers to use to query in parallel for receipts", i18n.IntType)

ConfigTransactionsMaxHistoryCount = ffc("config.transactions.maxHistoryCount", "The number of historical status updates to retain in the operation", i18n.IntType)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just below here is a DeprecatedConfigTransactionsNonceStateTimeout, which I believe has been un-deprecated, and further down is a ConfigTXHandlerNonceStateTimeout among the "simple" handler configs, which I think has been deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for leaving you to work this out 🤦 - yes you're 100% right

@@ -113,7 +127,7 @@ func setDefaults() {

// Deprecated default values for transaction handling configurations
viper.SetDefault(string(DeprecatedTransactionsMaxInFlight), 100)
viper.SetDefault(string(DeprecatedTransactionsNonceStateTimeout), "1h")
viper.SetDefault(string(TransactionsNonceStateTimeout), "1h")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to move this away from the comment saying these configs are deprecated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filename has a typo

MsgSQLScanFailed = ffe("FF21082", "Failed to read value from DB (%T)")
MsgShuttingDown = ffe("FF21083", "Connector shutdown initiated", 500)
MsgTransactionPersistenceError = ffe("FF21084", "Failed to persist transaction data", 500)
MsgOpNotSupportedWithoutRichQuery = ffe("FF21085", "Not supported: The connector must be configured with a rich query database to support this operation", 405)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is strictly a case for HTTP 405 (that's usually when you POST something that only supports GET, or similar). HTTP 501 might be more correct?

features := dbsql.DefaultSQLProviderFeatures()
features.PlaceholderFormat = sq.Dollar
features.UseILIKE = false // slower than lower()
features.AcquireLock = func(lockName string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like this functionality is ever used, so it's questionable if we should require implementing it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wider pain point here, is that the PSQL handler is implemented in multiple repos - copy/paste.
This was because we didn't want to add the whole postgres library import to the base common functionality. Might be worth revisiting.

For this PR, the code is a carbon copy of what's in other repos.

Signed-off-by: Peter Broadhurst <[email protected]>
Signed-off-by: Peter Broadhurst <[email protected]>
Signed-off-by: Peter Broadhurst <[email protected]>
Signed-off-by: Peter Broadhurst <[email protected]>
Copy link
Contributor

@awrichar awrichar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ended up being multiple large items coupled together (adding a new database choice along with a pretty comprehensive rewrite of the transaction and confirmation handling). It seems like a major step forward on many fronts, but it's tough to wrap my head around all the changes at once.

That said, everything seems architecturally sound, and I couldn't see any obvious problems. I think this will just need a lot of exercise and testing to ensure it functions as intended.

Signed-off-by: Peter Broadhurst <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants