-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hostmetrics: use WMI to fetch ppid #35337
Conversation
Scraping metrics on Windows largely has one truly expensive operation: getting the Parent Process ID. gopsutil does this with the [CreateToolhelp32Snapshot](https://learn.microsoft.com/en-us/windows/win32/api/tlhelp32/nf-tlhelp32-createtoolhelp32snapshot) Win32 API call. This takes a snapshot of a large amount of process details and ends up being very expensive. This is causing CPU performance issues for Windows users. This PR does the following in an attempt to address this perforamnce issue: * Repurpose the handlecount package to be for querying any WMI process data and add ParentProcessID to that querying process * Add the featureflag `hostmetrics.process.wmiParentProcessID` which will enable using the WMI query for parent process ID fetching * Whether the featureflag is on or off, this PR also adds a guard around fetching the parent process ID that will only check it if the resource attribute for ppid is enabled. This means users in very resource constrained environments can opt out of that fetching either way
I finally had some time to rework this PR with new changes since the old one was closed when I let it go stale. Hopefully this time we can get it pushed through. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Not stale |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Not stale |
@@ -129,6 +130,7 @@ The following settings are optional: | |||
- `mute_process_cgroup_error` (default: false): mute the error encountered when trying to read the cgroup of a process the collector does not have permission to read | |||
- `mute_process_exe_error` (default: false): mute the error encountered when trying to read the executable path of a process the collector does not have permission to read (Linux only) | |||
- `mute_process_user_error` (default: false): mute the error encountered when trying to read a uid which doesn't exist on the system, eg. is owned by a user that only exists in a container. | |||
- `disable_wmi` (default: false) disable the use of Windows Management Interface for fetching process information. Please note that certain metrics cannot be collected without WMI disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose enable_wmi
(default true
) or maybe even better wmi_enabled
to avoid double negation with disable_wmi: false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it is a good idea to tie the data retrieved with the underlying technology to get it. Can't we instead rely on logging proper warning at startup if WMI is not available and use the enable/disable metrics to allow the user to suppress the warnings?
func configureWindowsSettings(s *scraper) error { | ||
if s.config.DisableWMI { | ||
if s.config.Metrics.ProcessHandles.Enabled { | ||
return errProcessHandlesRequiresWMI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be checked in Config::Validate
to fail early and have the collector's validate
command use it.
func (wmiProcessQueryer) wmiProcessQuery() (map[int64]*wmiProcInfo, error) { | ||
processes := []Win32_Process{} | ||
// Based on reflection of Win32_Process type, this creates the following query: | ||
// `get-wmiobject -query "select ProcessId, ParentProcessId, HandleCount from Win32_Process"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method retrieves both parent process ID and process handle count on every run, regardless of whether both are needed.
Would it make sense to call different commands depending on which information is needed? Would it help in performance to for example only retrieve parent process ID and not retrieve handle count, assuming the user didn't enable the process.handles
metric?
The default configuration for the process scraper is to require parent process ID (as the process.parent_pid
resource attribute is enabled by default) and not require handle count (as the process.handles
metric is disabled by default).
@@ -116,7 +116,7 @@ Number of disk operations performed by the process. | |||
|
|||
Number of handles held by the process. | |||
|
|||
This metric is only available on Windows. | |||
This metric is only available on Windows. It requires Windows Management Interface to be enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric is only available on Windows. It requires Windows Management Interface to be enabled. | |
This metric is only available on Windows. It requires [Windows Management Instrumentation (WMI)](https://learn.microsoft.com/en-us/windows/win32/wmisdk/wmi-start-page) to be enabled. |
processes := []Win32_Process{} | ||
// Based on reflection of Win32_Process type, this creates the following query: | ||
// `get-wmiobject -query "select ProcessId, ParentProcessId, HandleCount from Win32_Process"` | ||
q := wmi.CreateQuery(&processes, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to @andrzej-stencel point above: wmi.CreateQuery
uses reflection to create a textual representation for the query and that won't change after startup, even, without optimizing to collect only the configured information there is no need to create the query on every "scrape".
Looking at github.com/yusufpapurcu/wmi
it seems that you can keep the same type Win32_Process
with all the fields that you may want depending on the config, even if those fields are not included in the textual query. However, for that to not report an error we need to proper set the WMI client configuration, see https://github.com/yusufpapurcu/wmi/blob/master/wmi.go#L108-L113 - It will be interesting evaluating the cost of that, i.e.: data structure with fields not present in the query.
@@ -129,6 +130,7 @@ The following settings are optional: | |||
- `mute_process_cgroup_error` (default: false): mute the error encountered when trying to read the cgroup of a process the collector does not have permission to read | |||
- `mute_process_exe_error` (default: false): mute the error encountered when trying to read the executable path of a process the collector does not have permission to read (Linux only) | |||
- `mute_process_user_error` (default: false): mute the error encountered when trying to read a uid which doesn't exist on the system, eg. is owned by a user that only exists in a container. | |||
- `disable_wmi` (default: false) disable the use of Windows Management Interface for fetching process information. Please note that certain metrics cannot be collected without WMI disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it is a good idea to tie the data retrieved with the underlying technology to get it. Can't we instead rely on logging proper warning at startup if WMI is not available and use the enable/disable metrics to allow the user to suppress the warnings?
Thanks for the reviews @pjanotti and @andrzej-stencel, I'm acknowledging that I've seen them but can't address yet. Right after I asked for that review at the Collector SIG I had to be out for a bit so I wasn't able to get to it. Thanks for getting to it promptly even though I ended up being unable to answer. Will respond next week! |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
@braydonk let's know if you would like some help to move this forward, it is a good issue to improve. |
Description:
Scraping process metrics on Windows only really has one truly expensive operation: getting the Parent Process ID. gopsutil does this with the CreateToolhelp32Snapshot Win32 API call. This takes a snapshot of a large amount of process details and ends up being very expensive. This is causing CPU performance issues for Windows users.
This PR does the following in an attempt to address this performance issue:
disable_wmi
to turn WMI off in the rare case where you can't use WMIprocess.ppid
resource attribute is enabled. This means users in very resource constrained environments can opt out of fetching the metric regardless.Link to tracking Issue: #32947
Testing:
I tested this by doing two builds of the collector and comparing the performance against each other. With the WMI feature flag enabled, there is a reduction in the CPU usage. (red = before, green = after)
I also compared the original way of getting parent process ID against disabling the resource attribute. The difference is even more drastic. (red = before, green = after)
Documentation:
Added the new configuration value to the README. Added documentation note to
process.handles
to note that it has a hard requirement on WMI.