-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU metrics are incorrect on Windows machines with more than 64 cores #40926
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
fyi @flexitrev |
It is not immediately clear how to solve this, and the solution does not strictly seem like it will be simple, especially without being able to use the Task Manager implementation as a reference. I suspect we will need something like a thread or process per processor group with their processor group affinities set appropriately so that they get scheduled into different processor groups. |
Is this documented anywhere? I don't see it in the My reading of this MS doc suggests our threads can switch between processor groups. Here's python pseudocode to do this. I don't have a >64T system to test with. global last_system_times = {}
def get_cpu_time_deltas():
idle_time_delta, kernel_time_delta, user_time_delta = 0, 0, 0
# Backup original affinity
GetThreadGroupAffinity(GetCurrentThread(), &saved_group_affinity)
# Enumerate each NUMA node
for node in range(GetNumaHighestNodeNumber()):
# Enumerate all the processor groups for this NUMA node
for group_affinity in GetNumaNodeProcessorMask2(node):
# Switch to this processor group
SetThreadGroupAffinity(GetCurrentThread(), &group_affinity)
# Retrieve metrics for this processor group
GetSystemTimes(&idle, &kernel, &user)
# Retrieve old values
if group_affinity.group in last_system_times:
last_idle, last_kernel, last_user = last_system_times[group_affinity.group]
# Compute deltas
idle_time_delta += (idle - last_idle)
kernel_time_delta += (kernel - last_kernel)
user_time_delta += (user - last_user)
# Store latest values
last_system_times[group_affinity.group] = (idle, kernel, user)
# Restore original affinity
SetThreadGroupAffinity(GetCurrentThread(), &saved_group_affinity)
return idle_time_delta, kernel_time_delta, user_time_delta |
On Windows, Metricbeat measures CPU use via the Windows API call
GetSystemTimes
. Each metrics interval, it fetches the CPU numbers, and compares them to the previous measurement to determine CPU load during that interval. On most systems this includes CPU time "including all threads in all processes, on all processors". However, on systems with more than 64 cores, it returns only the data for the current processor group of up to 64 cores.This has two consequences on high-core machines:
GetSystemTimes
returns data from a different set of cores. If the new processor group has a lower CPU total than the previous one, Metricbeat will report negative numbers for some CPU metrics.The most visible symptom is occasional negative numbers in CPU-related metrics, especially coming in pairs of two adjacent data points.
This seems to apply to ~all Metricbeat versions, and all versions of Windows that support more than 64 CPU cores.
The text was updated successfully, but these errors were encountered: