-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fleet-Server not starting up on 8.14-SNAPSHOT on cloud (intermittent) #3328
Comments
I've been unable to recreate this with a self managed cluster using the latest snapshots locally |
Were you able to reproduce on cloud? I'm wondering if the issue is specific to cloud preconfiguration or some other cloud specific config. |
I found this warning in the logs, which is interesting because the fleet-server host points to a https host url. Not sure if it has anything to do with the coordinator.
|
It seems that the issue is only reproducible with terraform, for some reason the hardware template I tried to reproduce the issue with the template |
@juliaElastic it started happening with https://admin.found.no/deployments/3483523f271adc9936722a933979ad11 |
I'm missing this from ES logs (other
|
That error comes from the index monitor used to watch for fleet actions and policy updates. Getting regular i/o timeouts from those index monitors seems like it could cause this. |
it still happens in 8.14.0-BC1, with a different template. |
@kuisathaverat - Hey Ivan, this should be fixed by elastic/kibana#181624. Are you still seeing this behavior on the latest snapshot build? |
We have tested these Docker images that contain the fix, the issue persists https://artifacts-api.elastic.co/v1/versions/8.15.0-SNAPSHOT/builds/8.15.0-00251ce4
|
Looking at the latest instance logs, I'm seeing something strange, it says Admin link: https://admin.found.no/deployments/15e5bafefa50a01b902fb82c919495fa/integrations_server
While also seeing this: |
The top level |
One more thing I noticed, on the deployment where we have the issue it says Though the issue can be reproduced even when |
I created a custom image from 8.14 branch with a lot of info logs, and can't reproduce the issue:
I'm wondering if this is some kind of concurrency issue, something like fleet-server not picking up config changes in some cases (apm or tls config) |
It seems it's a concurrency issue at it's do not happen every time and a restart seems to fix it, maybe we could try to hardcode some delay to be able to reproduce that issue |
Looking at fleet server code there we should never log @michel-laterman how can we get an empty policy ID here fleet-server/internal/pkg/server/fleet.go Line 492 in 382222f
Also it seems it's something we introduced in 8.14 as if you search for |
I think this log message comes because there is no fleet server policy with It seems the issue started on Feb 22 on 8.14-SNAPSHOT: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/zvjg9 Though this message is misleading, it doesn't always indicate a problem with missing policies, it is looked on healthy clusters as well: e.g. this today https://admin.found.no/deployments/a91a27b6929a1da13493ceaed4411c56/integrations_server Interesting that the "APM instrumentation enabled" messages somewhat correlate, they started from Feb 23 Another correlation with the I found one deployment today where the Policy.ID is empty, though it didn't occur together with the missing coordinator Here is another cluster when this happens, still running: https://admin.found.no/deployments/fddc23512824237462861336b0cfe2ed I found that the issue is reproducible in BC builds (version 8.14.0) and hard to reproduce on 8.14.0-SNAPSHOT or custom images. |
I think the log message is because Fleet server is not correctly configured (without policy ID) if the policy ID was correct it should log fleet-server/internal/pkg/policy/self.go Line 200 in 382222f
Looking at the timeline you provided it could be related to that PR #3277 I added some test here and it seems we are loosing the policy ID #3508 |
good catch, I think the fleet-server/internal/pkg/server/agent.go Lines 429 to 436 in 528e4ae
@nchaulet tested locally with your agent_test and adding policy to the instrumenation config works. I'm not sure if there is a way to add all keys from input (like spread operator in ts
|
if we merge with different option than |
You are right, the default options have this: Using
|
@juliaElastic - Can this issue be closed as the fix has been verified? |
We can close it, will verify again in cloud when BC3 is built. |
Added test issue here: #3516. Let me know if I've captured this issue accurately, and feel free to comment on that issue to clarify further. |
While trying to verify the fix in BC3 build, I'm noticing something strange.
But still in fleet-server logs tracing is not enabled. Tried enabling traces manually in this deployment by adding this in Advanced Edit / Deployment config under
I'm seeing the logs that instrumentation is enabled with the right apm settings, but then it's immediately disabled again. At least the original issue seems resolved, but I'm not sure if the cloud APM instrumentation works correctly. Continued testing on snapshot build, by adding full apm config to user settings, it is confirmed that arrived to
I added more logs around the merge config logic, not seeing any of that in the logs. It looks as if agent doesn't pick up the APM config added by cloud, tried restarting integration server, but it didn't help. |
@juliaElastic, How are you enabling instrumentation? I've tried to use the terraform in
Where the host in the list is the cluster's own address; then update the deployment. The logs you posted had one entry where tracing (in fleet-server) was enabled; however i don't see that in my deployment. |
There is a default APM config added to non-snapshot version deployments (documented here), Alex Piggott confirmed that the apm config is there in the It's possible that this is not a new issue, I created a 8.13.2 cluster with oblt-cli, and the APM instrumentation is disabled there too: https://admin.found.no/deployments/801c633a8dd40de2d71470a5c5e0b01d/integrations_server Also looked at Though it's possible the cloud config didn't work in fleet-server before because this change was needed: #3277 |
Okay, I was able to enable fleet-server traces with a config added to user settings here (8.14 cluster): https://admin.found.no/deployments/6bd9a5eb5220c666921a8f44501d938e/integrations_server
It is strange though that the log says So to summarize what seems to work in 8.14:
What does not seem to work:
Following the thread mentioned here it seems that the default apm config sends the traces to the overview cloud cluster, I'm seeing traces from 8.14 here: https://overview.elastic-cloud.com/app/r/s/JIEzg These are the deployments having fleet-server traces in the overview cloud prod cluster: |
Created a follow up issue for the missing traces. Closing this as the original issue is resolved. |
Reported on cloud deployments in 8.14-SNAPSHOT version, that Fleet-Server not starting up with the following error:
When looking at the
.fleet-policies
index, it seems that the coordinator is not picking up the policy changes.In fleet-server logs, there are no errors, but it seems the coordinator doesn't pick up the change.
Deployment where the issue is reproduced: https://admin.found.no/deployments/19ed6657cf3dccbba39c8b6faacb67f9
The text was updated successfully, but these errors were encountered: