Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat windows service metrics stops sending documents when a single service fails #40765

Open
TheRiffRafi opened this issue Sep 11, 2024 · 4 comments
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@TheRiffRafi
Copy link

TheRiffRafi commented Sep 11, 2024

  • Version: 8.10.4
  • Operating System:
runtime:
    arch: amd64
    os: windows
    osinfo:
        family: windows
        major: 6
        minor: 3
        patch: 0
        type: windows
        version: "6.3"
  • Steps to Reproduce:
    No clear steps to reproduce, more info on this later.

Multiple instances of elastic-agent installations are failing to send the windows.service metric set for the windows integration. The system integration continues to send data without issues. The problem happens at random and it is resolved by restarting the elastic agent.
The issue happens in different versions of 8.x for elastic-agent and it hasn't confirmed as occurring on the latest version (as the user who has experienced this has not upgraded to latest version yet). The issue so far has only been seen on 8.10.4

The error reported by metricbeat is the following:

{"log.level":"error","@timestamp":"2024-07-29T20:49:33.157Z","message":"Error fetching data for metricset windows.service: OpenProcess failed for pid=1724: The parameter is incorrect.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"windows/metrics-default","type":"windows/metrics"},"log":{"source":"windows/metrics-default"},"log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

So far the error indicates a problem only with one particular windows service, however, all other services being monitored by metricbeat can't continue to be monitored because this particular service getting in an unexpected state causes the entire metricbeat windows service metricset to stop reporting for any service.

Because this happens at random we are unable to setup debug logging to catch the failure and the logger for this function is not providing any more info.

We need to address 2 items with this issue:

  1. The windows service monitoring stops sending stats for ANY service once a single service gets into a weird state (this fits a bug description).
  2. There is no logger that specifies what that weird state was, nor an indication as to why sending service metrics for other services stops working (this fits a feature request that may or may not be necessary to address point 1).
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 11, 2024
@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Sep 11, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 11, 2024
@cmacknz
Copy link
Member

cmacknz commented Sep 12, 2024

@VihasMakwana I think I saw you had root caused the source of the OpenProcess failed for pid=1724: The parameter is incorrect error elsewhere? Or am I misremembering?

@VihasMakwana
Copy link
Contributor

VihasMakwana commented Sep 12, 2024

@cmacknz yes, that's correct.

On my personal desktop, the metricbeat wasn't able to access following processes, running as root:

  • PID 0 and 4 (protected processes, you can never access it.)
  • Processes owned by SYSTEM user. (some antivirus processes for eg.)
    • They were accessible, but with limited info.

This was for system.process integration though. The above issue is about windows.service integration but I believe the root cause is similar.


@TheRiffRafi do you see any warning related to SeDebugPrivilege at the beginning of logs?
Something like:
Metricbeat is running without SeDebugPrivilege, a Windows privilege that allows it to collect metrics...,
Failure while attempting to enable SeDebugPrivilege or Metricbeat failed to enable the SeDebugPrivilege?
Can you attach logs from beginning, if possible?

@TheRiffRafi
Copy link
Author

Hello @VihasMakwana!

Unfortunately I can't help with logs, all the instances I have of the failure have the logs with the problem already started, there is no instance of this where we've caught it in a state where the issue is not occurring and then suddenly starts happening (the systems are going weeks without reporting the service).

Also, I have to make a correction on the original description, we have only seen this on 8.10.4, we haven't tested on a more recent version as the entire stack for the user is still on 8.10.4, it was a misunderstanding that we had seen this problem on a later version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

4 participants