-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(discovery): ensure discovery node parent relationship, correct deletion flow #407
Conversation
0dbe8d6
to
ad8a370
Compare
/build_test |
Workflow started at 4/25/2024, 11:16:05 AM. View Actions Run. |
@ebaron this should fix the OOMKilled loop we were seeing in k8s/OpenShift. There is still some occasional weirdness with the Topology view, or sometimes with Agent deregistration on clean exit. I'm trying to narrow down what's happening there and will file a separate bug and fix once I have it clearer. |
No GraphQL schema changes detected. |
OpenAPI schema change detected: diff --git a/schema/openapi.yaml b/schema/openapi.yaml
index 87a077f..e426232 100644
--- a/schema/openapi.yaml
+++ b/schema/openapi.yaml
@@ -47,20 +47,38 @@ components:
properties:
connectUrl:
type: string
jvmId:
type: string
recordings:
items:
$ref: '#/components/schemas/ArchivedRecording'
type: array
type: object
+ Credential:
+ properties:
+ id:
+ format: int64
+ type: integer
+ matchExpression:
+ $ref: '#/components/schemas/MatchExpression'
+ password:
+ pattern: \S
+ type: string
+ username:
+ pattern: \S
+ type: string
+ required:
+ - matchExpression
+ - username
+ - password
+ type: object
Data:
type: object
DiscoveryNode:
properties:
children:
items:
$ref: '#/components/schemas/DiscoveryNode'
type: array
id:
format: int64
@@ -82,20 +100,22 @@ components:
- nodeType
- labels
type: object
DiscoveryPlugin:
properties:
builtin:
type: boolean
callback:
format: uri
type: string
+ credential:
+ $ref: '#/components/schemas/Credential'
id:
$ref: '#/components/schemas/UUID'
realm:
$ref: '#/components/schemas/DiscoveryNode'
required:
- id
- realm
type: object
Evaluation:
properties:
|
/build_test |
Workflow started at 4/25/2024, 11:22:04 AM. View Actions Run. |
CI build and push: At least one test failed ❌ (JDK17) |
No GraphQL schema changes detected. |
No OpenAPI schema changes detected. |
CI build and push: At least one test failed ❌ (JDK17) |
/build_test |
Workflow started at 4/25/2024, 11:39:21 AM. View Actions Run. |
No OpenAPI schema changes detected. |
No GraphQL schema changes detected. |
CI build and push: All tests pass ✅ (JDK17) |
It seems better in OpenShift. I'm still seeing some repeated failed attempts to connect but that might be due to the test application getting killed or failing to respond due to memory pressure:
I'm not seeing the multiple notifications per second I was before. |
The multiple attempts look like they're because of the server's internal logic to retry connections a few times in case there are underlying network issues, so it will try a few times before responding to the original client's request with a failure. It also behaves the same way when trying to connect to a target to determine its JVM ID at discovery time. The |
Some of the changes to the discovery registration endpoint are causing agents to fail to properly refresh when the server notifies them that their JWT is going to expire. Working on that part now - hopefully the new ping-back logic can still be kept in. |
…, instead look up credential associated with matching discovery plugin
…dentials, instead look up credential associated with matching discovery plugin" This reverts commit 813a44e.
Replaced by #415 |
Welcome to Cryostat3! 👋
Before contributing, make sure you have:
main
branch[chore, ci, docs, feat, fix, test]
To recreate commits with GPG signature
git fetch upstream && git rebase --force --gpg-sign upstream/main
Fixes: #406
Description of the change:
Motivation for the change:
Points 1-3 above are the main change and address the "rapid registration loop" bug that the server and Agent fall into when the Agent uncleanly exits, comes back up almost immediately, and tries to register itself again. This would fail because the Agent would be allowed to register since its callback URL was "new" (differing only by the userinfo credentials reference), but the Agent would then not be allowed to publish its list of target nodes because they are duplicates of what is already in the database. The Agent would detect this failure and drop back out to trying to register, then publish, fail, and loop.
Point 4 is just some code cleanup that should not have any effect in practice.
Point 5 fixes an occasional bug that would occur when the discovery ping period for a plugin fired, and the failed ping task would not be able to delete the plugin.
Point 6 avoids exceptions about k8s Informer initialization when running in a non-k8s environment.
How to manually test: