Skip to content

Latest commit

 

History

History
94 lines (57 loc) · 5.87 KB

cpu.md

File metadata and controls

94 lines (57 loc) · 5.87 KB

cpu

Monitor Type: cpu (Source)

Accepts Endpoints: No

Multiple Instances Allowed: No

Overview

This monitor reports cpu metrics.

On Linux hosts, this monitor relies on the /proc filesystem. If the underlying host's /proc file system is mounted somewhere other than /proc please specify the path using the top level configuration procPath.

procPath: /proc
monitors:
 - type: cpu

Configuration

To activate this monitor in the Smart Agent, add the following to your agent config:

monitors:  # All monitor config goes under this key
 - type: cpu
   ...  # Additional config

For a list of monitor options that are common to all monitors, see Common Configuration.

Config option Required Type Description
reportPerCPU no bool If true, stats will be generated for the system as a whole as well as for each individual CPU/core in the system and will be distinguished by the cpu dimension. If false, stats will only be generated for the system as a whole that will not include a cpu dimension. (default: false)

Metrics

These are the metrics available for this monitor. Metrics that are categorized as container/host (default) are in bold and italics in the list below.

  • cpu.idle (cumulative)
    CPU time spent not in any other state. In order to get a percentage this value must be compared against the sum of all CPU states.

  • cpu.interrupt (cumulative)
    CPU time spent while servicing hardware interrupts. A hardware interrupt happens at the physical layer. When this occurs, the CPU will stop whatever else it is doing and service the interrupt. This metric measures how many jiffies were spent handling these interrupts. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by faulty hardware such as a broken peripheral.

  • cpu.nice (cumulative)
    CPU time spent in userspace running 'nice'-ed processes. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by: 1) The server not having enough CPU capacity for a process, 2) A programming error which causes a process to use an unexpected amount of CPU

  • cpu.num_processors (gauge)
    The number of logical processors on the host.

  • cpu.softirq (cumulative)
    CPU time spent while servicing software interrupts. Unlike a hardware interrupt, a software interrupt happens at the sofware layer. Usually it is a userspace program requesting a service of the kernel. This metric measures how many jiffies were spent by the CPU handling these interrupts. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by a programming error which causes a process to unexpectedly request too many services from the kernel.

  • cpu.steal (cumulative)
    CPU time spent waiting for a hypervisor to service requests from other virtual machines. This metric is only present on virtual machines. This metric records how much time this virtual machine had to wait to have the hypervisor kernel service a request. In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by: 1) Another VM on the same hypervisor using too many resources, or 2) An underpowered hypervisor

  • cpu.system (cumulative)
    CPU time spent running in the kernel. This value reflects how often processes are calling into the kernel for services (e.g to log to the console). In order to get a percentage this value must be compared against the sum of all CPU states. A sustained high value for this metric may be caused by: 1) A process that needs to be re-written to use kernel resources more efficiently, or 2) A userspace driver that is broken

  • cpu.user (cumulative)
    CPU time spent running in userspace. In order to get a percentage this value must be compared against the sum of all CPU states. If this value is high: 1) A process requires more CPU to run than is available on the server, or 2) There is an application programming error which is causing the CPU to be used unexpectedly.

  • cpu.utilization (gauge)
    Percent of CPU used on this host.

  • cpu.utilization_per_core (gauge)
    Percent of CPU used on each core

  • cpu.wait (cumulative)
    Amount of total CPU time spent idle while waiting for an I/O operation to complete. In order to get a percentage this value must be compared against the sum of all CPU states. A high value for a sustained period may be caused by: 1) A slow hardware device that is taking too long to service requests, or 2) Too many requests being sent to an I/O device

Non-default metrics (version 4.7.0+)

To emit metrics that are not default, you can add those metrics in the generic monitor-level extraMetrics config option. Metrics that are derived from specific configuration options that do not appear in the above list of metrics do not need to be added to extraMetrics.

To see a list of metrics that will be emitted you can run agent-status monitors after configuring this monitor in a running agent instance.

Dimensions

The following dimensions may occur on metrics emitted by this monitor. Some dimensions may be specific to certain metrics.

Name Description
cpu The number/id of the core/cpu on the system. Only present if reportPerCPU: true.