-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(discovery): discovery synchronization for stale lost targets #689
base: main
Are you sure you want to change the base?
Conversation
Deployed on OpenShift with:
Manually testing with various ways to mess around with a sample application deployment:
With or without this PR I can eventually get into a bad state as described in the original issue - either a stale discovered Pod that is not really there anymore, or else some Pods that exist but are not discovered. cryostat-6659dd8598-d2n75-cryostat.log After getting Cryostat into this state, with the various discovery exceptions logged above, I can no longer get it to discover other sample applications - even fully undeploying the sample application and deploying it fresh, or restarting or re-scaling the deployment. |
and repeatedly restarting the deployment to cause rollouts of multiple replicas seems to be a good way to trigger the bugged behaviour, eventually. It isn't deterministic and there are definitely multiple worker threads involved, so this seems like a thread synchronization issue and/or race condition. |
2a01b3a
to
963e26e
Compare
/build_test |
Workflow started at 10/17/2024, 12:03:01 PM. View Actions Run. |
No OpenAPI schema changes detected. |
No GraphQL schema changes detected. |
CI build and push: All tests pass ✅ |
/build_test |
Workflow started at 10/18/2024, 10:35:06 AM. View Actions Run. |
No OpenAPI schema changes detected. |
No GraphQL schema changes detected. |
CI build and push: All tests pass ✅ |
Seems like this is not a proper fix yet, only a mitigation - the problem occurs less frequently, but I can still get it to happen on occasion. I'll keep working on it. |
6805f84
to
6b4a902
Compare
/build_test |
Workflow started at 10/23/2024, 10:45:31 AM. View Actions Run. |
No GraphQL schema changes detected. |
No OpenAPI schema changes detected. |
CI build and push: All tests pass ✅ |
/build_test |
Workflow started at 10/23/2024, 11:19:56 AM. View Actions Run. |
No OpenAPI schema changes detected. |
… worker threads, attempt reconnections periodically
…ogic for periodic updates vs on-discovery updates
…ve recordings can be updated as well
…fter JVM ID is determined
65df06e
to
3301825
Compare
/build_test |
Workflow started at 10/24/2024, 4:03:10 PM. View Actions Run. |
No OpenAPI schema changes detected. |
No GraphQL schema changes detected. |
CI build and push: All tests pass ✅ |
Welcome to Cryostat! 👋
Before contributing, make sure you have:
main
branch[chore, ci, docs, feat, fix, test]
To recreate commits with GPG signature
git fetch upstream && git rebase --force --gpg-sign upstream/main
Fixes: #634
Description of the change:
Motivation for the change:
See #634
How to manually test:
quay.io/andrewazores/cryostat:k8s-discovery-30
helm install cryostat --set authentication.openshift.enabled=true --set core.route.enabled=true --set core.discovery.kubernetes.enabled=true --set core.image.repository=quay.io/andrewazores/cryostat --set core.image.tag=k8s-discovery-30 ./charts/cryostat/
oc rollout restart
andoc scale deployment
to try to trigger the bad behaviour.