Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(discovery): ensure discovery node parent relationship, correct deletion flow #407

Closed
wants to merge 27 commits into from

Conversation

andrewazores
Copy link
Member

@andrewazores andrewazores commented Apr 24, 2024

Welcome to Cryostat3! 👋

Before contributing, make sure you have:

  • Read the contributing guidelines
  • Linked a relevant issue which this PR resolves
  • Linked any other relevant issues, PR's, or documentation, if any
  • Resolved all conflicts, if any
  • Rebased your branch PR on top of the latest upstream main branch
  • Attached at least one of the following labels to the PR: [chore, ci, docs, feat, fix, test]
  • Signed all commits using a GPG signature

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main


Fixes: #406

Description of the change:

  1. ensures correct parent<->child relationships between discovery nodes in the database
  2. cleans up discovery plugin credential handling. Credentials were always required in practice but the model allowed for them to be optional/nullable everywhere. Now the credentials are checked more rigorously at plugin registration time, and associated in the model directly with a foreign key ID instead of referencing them only as embedded userinfo in the callback URL
  3. when a discovery plugin requests to register, check if there are any existing plugin registrations with the same callback URL (ignoring any userinfo credential reference). If there are, it is probably something like the same instance trying to re-register itself after an unclean shutdown. Verify this by attempting to ping the existing registration using the existing credentials - if this attempt fails it must be because the old credentials are no longer valid (makes sense, it's a new instance). If it somehow succeeds then the old instance is still reachable on the old/same URL with the old credentials, so reject the attempt.
  4. remove some explicit model relationship deletions and allow database to cascade them. The cascading operations were already defined anyway
  5. ensure plugin ping tasks run in a transaction context
  6. ensure k8s Informers are only created/started if that discovery mechanism is actually enabled and available
  7. upgrade smoketest db-viewer utility and increase server smoketest log level

Motivation for the change:

Points 1-3 above are the main change and address the "rapid registration loop" bug that the server and Agent fall into when the Agent uncleanly exits, comes back up almost immediately, and tries to register itself again. This would fail because the Agent would be allowed to register since its callback URL was "new" (differing only by the userinfo credentials reference), but the Agent would then not be allowed to publish its list of target nodes because they are duplicates of what is already in the database. The Agent would detect this failure and drop back out to trying to register, then publish, fail, and loop.

Point 4 is just some code cleanup that should not have any effect in practice.

Point 5 fixes an occasional bug that would occur when the discovery ping period for a plugin fired, and the failed ping task would not be able to delete the plugin.

Point 6 avoids exceptions about k8s Informer initialization when running in a non-k8s environment.

How to manually test:

  1. Run CRYOSTAT_IMAGE=quay.io... bash smoketest.bash...
  2. ...

@andrewazores
Copy link
Member Author

/build_test

Copy link

Workflow started at 4/25/2024, 11:16:05 AM. View Actions Run.

@andrewazores
Copy link
Member Author

@ebaron this should fix the OOMKilled loop we were seeing in k8s/OpenShift. There is still some occasional weirdness with the Topology view, or sometimes with Agent deregistration on clean exit. I'm trying to narrow down what's happening there and will file a separate bug and fix once I have it clearer.

Copy link

No GraphQL schema changes detected.

Copy link

OpenAPI schema change detected:

diff --git a/schema/openapi.yaml b/schema/openapi.yaml
index 87a077f..e426232 100644
--- a/schema/openapi.yaml
+++ b/schema/openapi.yaml
@@ -47,20 +47,38 @@ components:
       properties:
         connectUrl:
           type: string
         jvmId:
           type: string
         recordings:
           items:
             $ref: '#/components/schemas/ArchivedRecording'
           type: array
       type: object
+    Credential:
+      properties:
+        id:
+          format: int64
+          type: integer
+        matchExpression:
+          $ref: '#/components/schemas/MatchExpression'
+        password:
+          pattern: \S
+          type: string
+        username:
+          pattern: \S
+          type: string
+      required:
+        - matchExpression
+        - username
+        - password
+      type: object
     Data:
       type: object
     DiscoveryNode:
       properties:
         children:
           items:
             $ref: '#/components/schemas/DiscoveryNode'
           type: array
         id:
           format: int64
@@ -82,20 +100,22 @@ components:
         - nodeType
         - labels
       type: object
     DiscoveryPlugin:
       properties:
         builtin:
           type: boolean
         callback:
           format: uri
           type: string
+        credential:
+          $ref: '#/components/schemas/Credential'
         id:
           $ref: '#/components/schemas/UUID'
         realm:
           $ref: '#/components/schemas/DiscoveryNode'
       required:
         - id
         - realm
       type: object
     Evaluation:
       properties:

@andrewazores
Copy link
Member Author

/build_test

Copy link

Workflow started at 4/25/2024, 11:22:04 AM. View Actions Run.

Copy link

CI build and push: At least one test failed ❌ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8835144153

Copy link

No GraphQL schema changes detected.

Copy link

No OpenAPI schema changes detected.

Copy link

CI build and push: At least one test failed ❌ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8835236257

@andrewazores
Copy link
Member Author

/build_test

Copy link

Workflow started at 4/25/2024, 11:39:21 AM. View Actions Run.

Copy link

No OpenAPI schema changes detected.

Copy link

No GraphQL schema changes detected.

Copy link

CI build and push: All tests pass ✅ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8835482572

@ebaron
Copy link
Member

ebaron commented Apr 25, 2024

@ebaron this should fix the OOMKilled loop we were seeing in k8s/OpenShift. There is still some occasional weirdness with the Topology view, or sometimes with Agent deregistration on clean exit. I'm trying to narrow down what's happening there and will file a separate bug and fix once I have it clearer.

It seems better in OpenShift. I'm still seeing some repeated failed attempts to connect but that might be due to the test application getting killed or failing to respond due to memory pressure:

2024-04-25 18:55:56,876 INFO  [io.cry.tar.TargetConnectionManager] (executor-thread-45) Opening connection to service:jmx:rmi:///jndi/rmi://quarkus-test-agent-96b97cdb4-ssb6s:9097/jmxrmi
2024-04-25 18:55:56,887 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-45) connection attempt failed.

2024-04-25 18:55:56,887 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-45) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:55:56,887 INFO  [io.cry.tar.TargetConnectionManager] (executor-thread-40) Removing cached connection for service:jmx:rmi:///jndi/rmi://quarkus-test-agent-96b97cdb4-ssb6s:9097/jmxrmi: EXPLICIT
2024-04-25 18:55:56,892 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:55:56,892 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:55:56,992 INFO  [io.qua.htt.access-log] (executor-thread-40) 10.129.0.2 - - [25/Apr/2024:18:55:56 +0000] "GET /health/liveness HTTP/1.1" 204 -
2024-04-25 18:55:57,822 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:55:57,822 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:55:57,953 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:55:57,953 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:56:01,217 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:56:01,218 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:56:01,561 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

2024-04-25 18:56:01,561 WARN  [io.cry.tar.TargetConnectionManager] (executor-thread-40) java.net.UnknownHostException: quarkus-test-agent-96b97cdb4-ssb6s
2024-04-25 18:56:03,207 WARN  [io.cry.cor.net.JFRConnectionToolkit] (executor-thread-40) connection attempt failed.

I'm not seeing the multiple notifications per second I was before.

@andrewazores
Copy link
Member Author

The multiple attempts look like they're because of the server's internal logic to retry connections a few times in case there are underlying network issues, so it will try a few times before responding to the original client's request with a failure. It also behaves the same way when trying to connect to a target to determine its JVM ID at discovery time. The UnknownHostException looks like the application probably died while Cryostat still wanted to connect to it, so the logs are just showing some residual attempts. Cryostat should give up on those attempts within a few seconds.

@andrewazores
Copy link
Member Author

Some of the changes to the discovery registration endpoint are causing agents to fail to properly refresh when the server notifies them that their JWT is going to expire. Working on that part now - hopefully the new ping-back logic can still be kept in.

@andrewazores
Copy link
Member Author

Replaced by #415

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Discovery Plugin registrations and deregistrations behave badly
2 participants