Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostmetrics: use WMI to fetch ppid #35337

Closed
wants to merge 8 commits into from

Conversation

braydonk
Copy link
Contributor

Description:
Scraping process metrics on Windows only really has one truly expensive operation: getting the Parent Process ID. gopsutil does this with the CreateToolhelp32Snapshot Win32 API call. This takes a snapshot of a large amount of process details and ends up being very expensive. This is causing CPU performance issues for Windows users.

This PR does the following in an attempt to address this performance issue:

  • Repurpose the handlecount package to be for querying any Window Management Interface process data and add ParentProcessID to that querying process
  • Change the default ParentProcessID fetching process to be through the new WMI fetching API, with a new configuration option disable_wmi to turn WMI off in the rare case where you can't use WMI
  • Whether using WMI or not, this PR also adds a guard around fetching the parent process ID that will only check it if the process.ppid resource attribute is enabled. This means users in very resource constrained environments can opt out of fetching the metric regardless.

Link to tracking Issue: #32947

Testing:
I tested this by doing two builds of the collector and comparing the performance against each other. With the WMI feature flag enabled, there is a reduction in the CPU usage. (red = before, green = after)
perfmon screenshot showing the collector without the feature flag using on average 40% more CPU

I also compared the original way of getting parent process ID against disabling the resource attribute. The difference is even more drastic. (red = before, green = after)
perfmon screenshot showing the collector that has the parent process ID resource attribute labelled using drastically less CPU

Documentation:
Added the new configuration value to the README. Added documentation note to process.handles to note that it has a hard requirement on WMI.

Scraping metrics on Windows largely has one truly expensive operation:
getting the Parent Process ID. gopsutil does this with the
[CreateToolhelp32Snapshot](https://learn.microsoft.com/en-us/windows/win32/api/tlhelp32/nf-tlhelp32-createtoolhelp32snapshot)
Win32 API call. This takes a snapshot of a large amount of process
details and ends up being very expensive. This is causing CPU
performance issues for Windows users.

This PR does the following in an attempt to address this perforamnce
issue:
* Repurpose the handlecount package to be for querying any WMI process
  data and add ParentProcessID to that querying process
* Add the featureflag `hostmetrics.process.wmiParentProcessID` which
  will enable using the WMI query for parent process ID fetching
* Whether the featureflag is on or off, this PR also adds a guard around
  fetching the parent process ID that will only check it if the resource
  attribute for ppid is enabled. This means users in very resource
  constrained environments can opt out of that fetching either way
@braydonk braydonk requested review from dmitryax and a team as code owners September 21, 2024 02:09
@braydonk braydonk changed the title Ppid wmi hostmetrics: use WMI to fetch ppid Sep 21, 2024
@braydonk
Copy link
Contributor Author

I finally had some time to rework this PR with new changes since the old one was closed when I let it go stale. Hopefully this time we can get it pushed through.

Copy link
Contributor

github-actions bot commented Oct 5, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 5, 2024
@braydonk
Copy link
Contributor Author

braydonk commented Oct 6, 2024

Not stale

@dashpole dashpole removed the Stale label Oct 6, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 21, 2024
@braydonk
Copy link
Contributor Author

Not stale

@dashpole dashpole removed the Stale label Oct 21, 2024
@atoulme atoulme added the Run Windows Enable running windows test on a PR label Oct 23, 2024
@@ -129,6 +130,7 @@ The following settings are optional:
- `mute_process_cgroup_error` (default: false): mute the error encountered when trying to read the cgroup of a process the collector does not have permission to read
- `mute_process_exe_error` (default: false): mute the error encountered when trying to read the executable path of a process the collector does not have permission to read (Linux only)
- `mute_process_user_error` (default: false): mute the error encountered when trying to read a uid which doesn't exist on the system, eg. is owned by a user that only exists in a container.
- `disable_wmi` (default: false) disable the use of Windows Management Interface for fetching process information. Please note that certain metrics cannot be collected without WMI disabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose enable_wmi (default true) or maybe even better wmi_enabled to avoid double negation with disable_wmi: false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it is a good idea to tie the data retrieved with the underlying technology to get it. Can't we instead rely on logging proper warning at startup if WMI is not available and use the enable/disable metrics to allow the user to suppress the warnings?

func configureWindowsSettings(s *scraper) error {
if s.config.DisableWMI {
if s.config.Metrics.ProcessHandles.Enabled {
return errProcessHandlesRequiresWMI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be checked in Config::Validate to fail early and have the collector's validate command use it.

func (wmiProcessQueryer) wmiProcessQuery() (map[int64]*wmiProcInfo, error) {
processes := []Win32_Process{}
// Based on reflection of Win32_Process type, this creates the following query:
// `get-wmiobject -query "select ProcessId, ParentProcessId, HandleCount from Win32_Process"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method retrieves both parent process ID and process handle count on every run, regardless of whether both are needed.

Would it make sense to call different commands depending on which information is needed? Would it help in performance to for example only retrieve parent process ID and not retrieve handle count, assuming the user didn't enable the process.handles metric?

The default configuration for the process scraper is to require parent process ID (as the process.parent_pid resource attribute is enabled by default) and not require handle count (as the process.handles metric is disabled by default).

@@ -116,7 +116,7 @@ Number of disk operations performed by the process.

Number of handles held by the process.

This metric is only available on Windows.
This metric is only available on Windows. It requires Windows Management Interface to be enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This metric is only available on Windows. It requires Windows Management Interface to be enabled.
This metric is only available on Windows. It requires [Windows Management Instrumentation (WMI)](https://learn.microsoft.com/en-us/windows/win32/wmisdk/wmi-start-page) to be enabled.

processes := []Win32_Process{}
// Based on reflection of Win32_Process type, this creates the following query:
// `get-wmiobject -query "select ProcessId, ParentProcessId, HandleCount from Win32_Process"`
q := wmi.CreateQuery(&processes, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to @andrzej-stencel point above: wmi.CreateQuery uses reflection to create a textual representation for the query and that won't change after startup, even, without optimizing to collect only the configured information there is no need to create the query on every "scrape".

Looking at github.com/yusufpapurcu/wmi it seems that you can keep the same type Win32_Process with all the fields that you may want depending on the config, even if those fields are not included in the textual query. However, for that to not report an error we need to proper set the WMI client configuration, see https://github.com/yusufpapurcu/wmi/blob/master/wmi.go#L108-L113 - It will be interesting evaluating the cost of that, i.e.: data structure with fields not present in the query.

@@ -129,6 +130,7 @@ The following settings are optional:
- `mute_process_cgroup_error` (default: false): mute the error encountered when trying to read the cgroup of a process the collector does not have permission to read
- `mute_process_exe_error` (default: false): mute the error encountered when trying to read the executable path of a process the collector does not have permission to read (Linux only)
- `mute_process_user_error` (default: false): mute the error encountered when trying to read a uid which doesn't exist on the system, eg. is owned by a user that only exists in a container.
- `disable_wmi` (default: false) disable the use of Windows Management Interface for fetching process information. Please note that certain metrics cannot be collected without WMI disabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it is a good idea to tie the data retrieved with the underlying technology to get it. Can't we instead rely on logging proper warning at startup if WMI is not available and use the enable/disable metrics to allow the user to suppress the warnings?

@braydonk
Copy link
Contributor Author

braydonk commented Nov 1, 2024

Thanks for the reviews @pjanotti and @andrzej-stencel, I'm acknowledging that I've seen them but can't address yet. Right after I asked for that review at the Collector SIG I had to be out for a bit so I wasn't able to get to it. Thanks for getting to it promptly even though I ended up being unable to answer. Will respond next week!

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 16, 2024
Copy link
Contributor

github-actions bot commented Dec 1, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Dec 1, 2024
@pjanotti
Copy link
Contributor

pjanotti commented Dec 2, 2024

@braydonk let's know if you would like some help to move this forward, it is a good issue to improve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
receiver/hostmetrics Run Windows Enable running windows test on a PR Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants