Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metricbeat/system][windows] - Metricbeat reports DEGRADED while running in privileged mode #40484

Open
VihasMakwana opened this issue Aug 10, 2024 · 11 comments
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@VihasMakwana
Copy link
Contributor

  • Metricbeat reports DEGRADED while running in agent privileged mode.
  • It reports Access is denied errors, which results in DEGRADED mode.
    • This error is expected for unprivileged mode but not while running agent as an administrator
@VihasMakwana VihasMakwana added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Aug 10, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@VihasMakwana VihasMakwana self-assigned this Aug 10, 2024
@cmacknz
Copy link
Member

cmacknz commented Aug 12, 2024

What data stream was this observed for? Are their logs you can attach to the issue?

@VihasMakwana
Copy link
Contributor Author

@cmacknz the errors reported are similar to #40542 (comment)

    - id: system/metrics-default
      state:
        message: 'Healthy: communicating with pid ''1556'''
        pid: 0
        state: 2
        units:
            input-system/metrics-default-system/metrics-system-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                state: DEGRADED
                message: |-
                    Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                    error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                    GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                payload:
                    streams:
                        system/metrics-system.process.summary-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                            error: |-
                                Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                                error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                                GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                            status: DEGRADED

There's a coincidence. All the these PIDs refer to the SYSTEM processes.
On windows, we try to open the process with PROCESS_VM_READ and PROCESS_QUERY_LIMITED_INFORMATION access rights. More info here.
An administrator can open a system process with PROCESS_QUERY_LIMITED_INFORMATION, but not with PROCESS_VM_READ.
This results in an error.

This seems to be related with #17314

@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented Oct 7, 2024

@pierrehilbert @cmacknz I'll raise a POC PR for this and see how can we fix this, permanently.

Last week, I considered a solution and here’s what I came up with:

Accessing PID 0 and PID 4 on Windows is unnecessary, as these are protected processes.

  • For users with root privileges, there's no need to mark the module as degraded if they can’t access a process; this typically indicates it’s a protected process. Instead, we can log the error and maintain a healthy status.
  • The proposed flow is as follows:
    • Attempt to read basic metrics that require minimal permissions.
      - If this attempt fails, mark the module as degraded. This will most likely raise "access denied".
    • Next, try to read additional metrics that require higher permissions.
      - If this attempt fails, it’s okay to log the error without altering the module’s status.

cc: @elastic/elastic-agent-data-plane
I'd appreciate your thoughts over this.

@rdner
Copy link
Member

rdner commented Oct 7, 2024

@VihasMakwana I think this makes sense. I only have a comment about this:

  • If this attempt fails, it’s okay to log the error without altering the module’s status.

In this case we need to make sure we have a debug-level log that we failed to fetch additional metrics from a protected process. Error logs would inevitably flood the logs and it would not be a good user experience. If this behavior is expected, it should not be considered an error.

@VihasMakwana
Copy link
Contributor Author

In this case we need to make sure we have a debug-level log that we failed to fetch additional metrics from a protected process. Error logs would inevitably flood the logs and it would not be a good user experience. If this behavior is expected, it should not be considered an error.

Yes. We should log it at a debug level.

@cmacknz
Copy link
Member

cmacknz commented Oct 7, 2024

Accessing PID 0 and PID 4 on Windows is unnecessary, as these are protected processes.

Unnecessary isn't exactly how I'd phrase it, our endpoint-security service runs as a protected process on Windows, and we currently support collecting a limited set of important metrics for it. It particular we can get the CPU and memory usage for inclusion into the calculation of the agent's CPU and memory usage on the host. This support was added in elastic/elastic-agent-system-metrics#104. This introduced the concept of a non-fatal error originally, which might help you implement the logic you propose.

Can we get the CPU and memory usage for these PIDs as well, like we can for endpoint?

The proposed flow is as follows:
Attempt to read basic metrics that require minimal permissions.

  • If this attempt fails, mark the module as degraded. This will most likely raise "access denied".
    Next, try to read additional metrics that require higher permissions.
  • If this attempt fails, it’s okay to log the error without altering the module’s status.

👍 this overall makes sense to me. As mentioned above, make sure you test this logic against endpoint-security on Windows as it is a protected service that will hit this case. This potentially gives you a way to write automated tests as well since we have integration tests that install endpoint on Windows.

@VihasMakwana
Copy link
Contributor Author

Can we get the CPU and memory usage for these PIDs as well, like we can for endpoint?

I'll confirm and get back to you.

@VihasMakwana
Copy link
Contributor Author

@cmacknz Here's a summary after running countless tests on my windows machine:

  • For pid 0,
    • We can never open a handle for pid 0 (system idle process). From Windows OpenProcess doc,

      If the specified process is the System Idle Process (0x00000000), the function fails and the last error code is
      ERROR_INVALID_PARAMETER

    • As a result, we can't gather any metrics for it.

    • On task manager, I saw the cpu utilisation for pid 0, which I believe is calculated as 100 - ${cpu usage of all other processes}. Same for memory usage.

  • For pid 4:
    • We can access following metrics for pid 4:
      • CPU metrics
      • Memory metrics
      • start time
    • We cannot access following data for pid 4:
      • executable path, process arguments, process owner's username
  • For system-owned protected processes:
    • We can access following metrics for protected processes
      • CPU metrics
      • Memory metrics
      • start time
    • We cannot access following data for protected processes:
      • executable path, process arguments, process owner's username
  • For system-owned normal processes (unprotected):
    • All metrics and data are accessible.
  • For other processes:
    • All metrics and data are accessible.

@cmacknz
Copy link
Member

cmacknz commented Oct 8, 2024

That's an awesome summary, thank you!

I'd suggest adding this summary to the system module permissions section (in a new Windows sub-section) https://github.com/elastic/beats/blob/main/metricbeat/docs/modules/system.asciidoc#required-permissions.

Adding that same section to the system integration is also a good idea https://github.com/elastic/integrations/tree/main/packages/system

@leehinman
Copy link
Contributor

@VihasMakwana great investigation. This looks consistent with what users would see with Powershell, ProcessExplorer and TaskManager (see below), which is where I think we need to be.

PowerShell Get-Process:

PS C:\Users\vagrant> Get-Process -ID 0

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
      0       0       60          8                 0   0 Idle


PS C:\Users\vagrant> Get-Process -ID 4

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
   2072       0       36        120       8.75      4   0 System

ProcessExplorer

Image

TaskManager

Image

VihasMakwana added a commit to elastic/elastic-agent-system-metrics that referenced this issue Oct 25, 2024
- Enhancement

We can ignore the error in two cases:
- While reading the process executable name. 
- For pid 4, this call fails as we can't access the executable name via
the system call. Same for other kernel-level processes.
- While finding the owner for a particular process.
- We try to open the process token via `syscall.OpenProcessToken`and we
can't access the token for protected processes , even as an
administrator.
    
it's okay to ignore these errors and move forward as we can access few
other metrics (memory, cpu).
    
More context
[here](elastic/beats#40484 (comment))

Relates elastic/beats#40484
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

5 participants