fix(discovery): ensure discovery node parent relationship, correct deletion flow #407

andrewazores · 2024-04-24T20:18:27Z

Welcome to Cryostat3! 👋

Before contributing, make sure you have:

Read the contributing guidelines
Linked a relevant issue which this PR resolves
Linked any other relevant issues, PR's, or documentation, if any
Resolved all conflicts, if any
Rebased your branch PR on top of the latest upstream main branch
Attached at least one of the following labels to the PR: [chore, ci, docs, feat, fix, test]
Signed all commits using a GPG signature

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main

Fixes: #406

Description of the change:

ensures correct parent<->child relationships between discovery nodes in the database
cleans up discovery plugin credential handling. Credentials were always required in practice but the model allowed for them to be optional/nullable everywhere. Now the credentials are checked more rigorously at plugin registration time, and associated in the model directly with a foreign key ID instead of referencing them only as embedded userinfo in the callback URL
when a discovery plugin requests to register, check if there are any existing plugin registrations with the same callback URL (ignoring any userinfo credential reference). If there are, it is probably something like the same instance trying to re-register itself after an unclean shutdown. Verify this by attempting to ping the existing registration using the existing credentials - if this attempt fails it must be because the old credentials are no longer valid (makes sense, it's a new instance). If it somehow succeeds then the old instance is still reachable on the old/same URL with the old credentials, so reject the attempt.
remove some explicit model relationship deletions and allow database to cascade them. The cascading operations were already defined anyway
ensure plugin ping tasks run in a transaction context
ensure k8s Informers are only created/started if that discovery mechanism is actually enabled and available
upgrade smoketest db-viewer utility and increase server smoketest log level

Motivation for the change:

Points 1-3 above are the main change and address the "rapid registration loop" bug that the server and Agent fall into when the Agent uncleanly exits, comes back up almost immediately, and tries to register itself again. This would fail because the Agent would be allowed to register since its callback URL was "new" (differing only by the userinfo credentials reference), but the Agent would then not be allowed to publish its list of target nodes because they are duplicates of what is already in the database. The Agent would detect this failure and drop back out to trying to register, then publish, fail, and loop.

Point 4 is just some code cleanup that should not have any effect in practice.

Point 5 fixes an occasional bug that would occur when the discovery ping period for a plugin fired, and the failed ping task would not be able to delete the plugin.

Point 6 avoids exceptions about k8s Informer initialization when running in a non-k8s environment.

How to manually test:

Run CRYOSTAT_IMAGE=quay.io... bash smoketest.bash...
...

andrewazores · 2024-04-25T15:15:45Z

/build_test

github-actions · 2024-04-25T15:16:06Z

Workflow started at 4/25/2024, 11:16:05 AM. View Actions Run.

andrewazores · 2024-04-25T15:16:53Z

@ebaron this should fix the OOMKilled loop we were seeing in k8s/OpenShift. There is still some occasional weirdness with the Topology view, or sometimes with Agent deregistration on clean exit. I'm trying to narrow down what's happening there and will file a separate bug and fix once I have it clearer.

github-actions · 2024-04-25T15:19:05Z

No GraphQL schema changes detected.

github-actions · 2024-04-25T15:19:06Z

OpenAPI schema change detected:

diff --git a/schema/openapi.yaml b/schema/openapi.yaml
index 87a077f..e426232 100644
--- a/schema/openapi.yaml
+++ b/schema/openapi.yaml
@@ -47,20 +47,38 @@ components:
       properties:
         connectUrl:
           type: string
         jvmId:
           type: string
         recordings:
           items:
             $ref: '#/components/schemas/ArchivedRecording'
           type: array
       type: object
+    Credential:
+      properties:
+        id:
+          format: int64
+          type: integer
+        matchExpression:
+          $ref: '#/components/schemas/MatchExpression'
+        password:
+          pattern: \S
+          type: string
+        username:
+          pattern: \S
+          type: string
+      required:
+        - matchExpression
+        - username
+        - password
+      type: object
     Data:
       type: object
     DiscoveryNode:
       properties:
         children:
           items:
             $ref: '#/components/schemas/DiscoveryNode'
           type: array
         id:
           format: int64
@@ -82,20 +100,22 @@ components:
         - nodeType
         - labels
       type: object
     DiscoveryPlugin:
       properties:
         builtin:
           type: boolean
         callback:
           format: uri
           type: string
+        credential:
+          $ref: '#/components/schemas/Credential'
         id:
           $ref: '#/components/schemas/UUID'
         realm:
           $ref: '#/components/schemas/DiscoveryNode'
       required:
         - id
         - realm
       type: object
     Evaluation:
       properties:

andrewazores · 2024-04-25T15:21:46Z

/build_test

github-actions · 2024-04-25T15:22:05Z

Workflow started at 4/25/2024, 11:22:04 AM. View Actions Run.

github-actions · 2024-04-25T15:22:22Z

CI build and push: At least one test failed ❌ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8835144153

github-actions · 2024-04-25T15:25:02Z

No GraphQL schema changes detected.

github-actions · 2024-04-25T15:25:03Z

No OpenAPI schema changes detected.

github-actions · 2024-04-25T15:28:19Z

CI build and push: At least one test failed ❌ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8835236257

andrewazores · 2024-04-25T15:39:00Z

/build_test

github-actions · 2024-04-25T15:39:22Z

Workflow started at 4/25/2024, 11:39:21 AM. View Actions Run.

github-actions · 2024-04-25T15:42:32Z

No OpenAPI schema changes detected.

github-actions · 2024-04-25T15:42:32Z

No GraphQL schema changes detected.

github-actions · 2024-04-25T15:46:04Z

CI build and push: All tests pass ✅ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8835482572

ebaron · 2024-04-25T19:00:10Z

@ebaron this should fix the OOMKilled loop we were seeing in k8s/OpenShift. There is still some occasional weirdness with the Topology view, or sometimes with Agent deregistration on clean exit. I'm trying to narrow down what's happening there and will file a separate bug and fix once I have it clearer.

It seems better in OpenShift. I'm still seeing some repeated failed attempts to connect but that might be due to the test application getting killed or failing to respond due to memory pressure:

2024-04-25 18:55:56,876 INFO  [io.cry.tar.TargetConnectionManager] (executor-thread-45) Opening connection to service:jmx:rmi:///jndi/rmi://quarkus-test-agent-96b97cdb4-ssb6s:9097/jmxrmi
2024-04-25 18:55:56,887 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-45) connection attempt failed.

2024-04-25 18:55:56,887 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-45) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:55:56,887 INFO  [io.cry.tar.TargetConnectionManager] (executor-thread-40) Removing cached connection for service:jmx:rmi:///jndi/rmi://quarkus-test-agent-96b97cdb4-ssb6s:9097/jmxrmi: EXPLICIT
2024-04-25 18:55:56,892 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:55:56,892 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:55:56,992 INFO  [io.qua.htt.access-log] (executor-thread-40) 10.129.0.2 - - [25/Apr/2024:18:55:56 +0000] "GET /health/liveness HTTP/1.1" 204 -
2024-04-25 18:55:57,822 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:55:57,822 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:55:57,953 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:55:57,953 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:56:01,217 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:56:01,218 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:56:01,561 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:56:01,561 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:56:03,207 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

I'm not seeing the multiple notifications per second I was before.

andrewazores · 2024-04-25T19:12:16Z

The multiple attempts look like they're because of the server's internal logic to retry connections a few times in case there are underlying network issues, so it will try a few times before responding to the original client's request with a failure. It also behaves the same way when trying to connect to a target to determine its JVM ID at discovery time. The UnknownHostException looks like the application probably died while Cryostat still wanted to connect to it, so the logs are just showing some residual attempts. Cryostat should give up on those attempts within a few seconds.

andrewazores · 2024-04-25T20:28:19Z

Some of the changes to the discovery registration endpoint are causing agents to fail to properly refresh when the server notifies them that their JWT is going to expire. Working on that part now - hopefully the new ping-back logic can still be kept in.

…, instead look up credential associated with matching discovery plugin

…dentials, instead look up credential associated with matching discovery plugin" This reverts commit 813a44e.

…ived

…on refresh

andrewazores · 2024-05-08T15:51:36Z

Replaced by #415

andrewazores force-pushed the gh406 branch from 06f86b5 to d6dc20e Compare April 24, 2024 23:21

andrewazores added fix safe-to-test labels Apr 24, 2024

This was referenced Apr 24, 2024

fix(topology): delete action should be available for custom targets cryostatio/cryostat-web#1252

Merged

[Bug] Topology view does not update on target notifications cryostatio/cryostat-web#1253

Closed

andrewazores force-pushed the gh406 branch 2 times, most recently from 0dbe8d6 to ad8a370 Compare April 25, 2024 15:15

andrewazores marked this pull request as ready for review April 25, 2024 15:15

andrewazores requested review from ebaron, aali309 and mwangggg April 25, 2024 15:15

andrewazores added 22 commits May 6, 2024 22:00

fixup! correct discovery plugin deletion to use db cascade

4bef2f1

discoveryplugins require a stored credential reference

35b8287

fixup! discoveryplugins require a stored credential reference

389fc27

avoid testing match expressions for Agent HTTP connection credentials…

4fba34d

…, instead look up credential associated with matching discovery plugin

Revert "avoid testing match expressions for Agent HTTP connection cre…

8f863de

…dentials, instead look up credential associated with matching discovery plugin" This reverts commit 813a44e.

fixup! fixup! discoveryplugins require a stored credential reference

56587bd

ping existing plugin if re-registration for same callback received

3d4755f

fixup! ping existing plugin if re-registration for same callback rece…

22f310b

…ived

rebase fix

4a4490a

only set credential if not already set

79f0ba4

add eager blank checks

6ce371f

set callback explicitly nullable

e305a33

increase log level

c9f3e86

upgrade db-viewer version

5873bf6

update schema

16455e5

spotbugs

e7305d5

update schema

9e6ba0d

remodel optional discoveryplugin<->credential relation

d718cdc

don't require realm and callback in POST, plugin may not supply this …

01a9f82

…on refresh

correct callback comparison

880c38d

cleanup

14f4076

enable Agent dual-registration case for testing

34c1ba7

andrewazores force-pushed the gh406 branch from 5914875 to 34c1ba7 Compare May 7, 2024 02:00

Merge branch 'main' into gh406

a52c8da

andrewazores closed this May 8, 2024

andrewazores deleted the gh406 branch May 8, 2024 15:56

andrewazores restored the gh406 branch May 8, 2024 16:06

andrewazores mentioned this pull request May 8, 2024

fix(discovery): JDP event bus broadcast to correct address #443

Merged

7 tasks

andrewazores deleted the gh406 branch May 30, 2024 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discovery): ensure discovery node parent relationship, correct deletion flow #407

fix(discovery): ensure discovery node parent relationship, correct deletion flow #407

andrewazores commented Apr 24, 2024 •

edited

Loading

andrewazores commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

andrewazores commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

andrewazores commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

andrewazores commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

ebaron commented Apr 25, 2024

andrewazores commented Apr 25, 2024

andrewazores commented Apr 25, 2024

andrewazores commented May 8, 2024

fix(discovery): ensure discovery node parent relationship, correct deletion flow #407

fix(discovery): ensure discovery node parent relationship, correct deletion flow #407

Conversation

andrewazores commented Apr 24, 2024 • edited Loading

Welcome to Cryostat3! 👋

Before contributing, make sure you have:

Description of the change:

Motivation for the change:

How to manually test:

andrewazores commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

andrewazores commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

andrewazores commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

andrewazores commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

ebaron commented Apr 25, 2024

andrewazores commented Apr 25, 2024

andrewazores commented Apr 25, 2024

andrewazores commented May 8, 2024

andrewazores commented Apr 24, 2024 •

edited

Loading