All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Add memory read error.
- Add support for building NPD on MacOS.
- Add support for containerd health check.
- Add metric for
per-cpu
,per-stage
timing.
- Fix an issue that kubelet may be restarted by NPD health checker unexpectedly. Make log pattern check configurable in health checker.
- Exit the process when there is a timeout in plugin daemon
0.8.6 - 2020-01-22
- Windows build now supported.
- Added metrics to retrieve stats such as
procs_running
andprocs_blocked
. - Added metrics to retrieve network stats.
- Added metric to retrieve guest OS features such as unknown modules, ktd, and kernel integrity.
- Print result's message when status is unknown.
- Fixed custom plugin command timeout when the command spawns a long running child process.
0.8.5 - 2020-11-18
- Added problem detection for buffer I/O error.
- Added CPU load average metrics support.
- Added kubelet apiserver connection check in health checker.
- Will now catch hung task with pattern like
tasks airflow scheduler: *
. - Better handling to avoid duplicating disk bytes metrics used on fstype and mount types.
- Fixed the deployment yaml to prevent NPD from scheduling onto windows nodes.
- Fixed memory unit for
/proc/meminfo
metrics. - Fixed OOMKilling detection for new linux kernel v5.1+.
0.8.4 - 2020-09-01
- Added
FSType
andMountOption
as labels to the metricdisk_usage_bytes
. - Added
DockerContainerStartupFailure
event indocker-monitor.json
to detect docker issue docker/for-linux#647.
- Reduced log spam generated by the custom plugin monitor.
0.8.3 - 2020-06-30
health-checker
binary now included in the docker image.
--enable-repair=true
is now the default for docker and kubelet health checker.- Custom plugin will now only generate status update log when the status has changed.
- Limit the size of custom plugin output to 4kb, extra output will be drained and discarded.
- Fix a race condition that services may be killed periodically when
--enable-repair=true
, and systemd service restart time equals the health check period.
0.8.2 - 2020-05-28
- Added an
--event-namespace
flag to make event namespace configurable. - Added
rhel
support in OS version. - Added
health-checker
as a custom plugin. Thehealth-checker
can be used to monitor healthiness of kubelet, docker and CRI container runtimes (e.g.containerd
,cri-o
) and restart them if they are not healthy ifenable-repair
is turned on.
- #420 Added
missing
lsblk
to the container image.
0.8.1 - 2020-02-25
- Added
host_uptime
metrics for CentOS. - Now collecting a lot more useful CPU/disk/memory metrics.
- Improved
network_problem.sh
to supportnf_conntrack
and report error when conntrack table is 90% full.
- #366 Fixed
building with
ENABLE_JOURNALD=0
. - Fixed the first 0 value metrics reported for
disk_avg_queue_len
. - Fix a few metric units for disk metrics and the calculation for
disk_avg_queue_len
.
0.8.0 - 2019-10-30
- Added Stackdriver exporter.
- Added a
k8s-exporter-heartbeat-period
flag to make the heart beat period of K8s exporter configurable.
- Changed the default heart beat period of K8s exporter from
1m
to5m
.
- Addressed an issue with a panic caused by closing an already closed channel.
- Fixed several potential busy loops.
0.7.1 - 2019-08-27
- Added validation that permanent problems habe a preset default condition.
- Empty LogPath will now use journald's default path.
- Systemd monitor now looks back 5 minutes.
- Bumped base image to
registry.k8s.io/debian-base-amd64:1.0.0
. - Updated the detection method for docker overlay2 issues.
- Moved NPD into the kube-system namespace.
- #202 Fixed an issue that condition can't switch back to false for custom plugins.
0.7.0 - 2019-07-25
- Added a system stats monitor is added into NPD as a new problem daemon. It
collects useful node problem related system stats with OpenCensus such as
disk/io_time
,disk/weighted_io
anddisk/avg_queue_len
. - Besides node condition and events, problems detected by existing problem
daemons are also collected into OpenCensus as metrics:
problem_counter{reason="PROBLEM_REASON"} xxx
for events andproblem_gauge{reason="PROBLEM_REASON",type="PROBLEM_TYPE"} 1 or 0
for conditions. - A Prometheus exporter is added to export all OpenCensus metrics collected by NPD through Prometheus.
- A plugin system for problem daemons is added. Problem daemons can be disabled
at compile time with build tags, such as
disable_system_stats_monitor
,disable_system_log_monitor
anddisable_custom_plugin_monitor
. - A problem exporter interface is added. The original kubernetes problem
reporting logic was moved into
k8sexporter
. Prometheus support is implemented asprometheusexporter
.
0.6.6 - 2019-08-13
- Updated the detection method for docker overlay2 issues.
- #202 Fixed an issue that condition can't switch back to false for custom plugins.
0.6.5 - 2019-07-24
- #295 Added configurable timeout to wait for apiserver to be ready before starting problem detection.
0.6.4 - 2019-06-13
- Switch from godep to go modules resulting in bumping versions of many dependencies.
- Changed custom plugin handling to run immediately on startup.
- #269 Fixed
issue so that using
--version
should not require monitors to be specified.
0.6.3 - 2019-04-05
- Added better handling and reporting when missing required flags.
- Disabled glog writing to files for the log-counter plugin.
0.6.2 - 2019-01-07
- Added resource limites to NPD deployment.
- Added log-counter to dockerfile.
- Added
enable_message_change_based_condition_update
option to enable condition update when messages change for custom plugin.
- #232 Explicitly include libsystemd0 in the image.
0.6.1 - 2018-11-28
- Bumped base image to
registry.k8s.io/debian-base-amd64:0.4.0
.
0.6.0 - 2018-11-27
- Added ConfigMap for NPD config.
- Added readonly filesystem detection.
- Added frequent kubelet/docker restart detection.
- Added corrupt docker overlay2 issue detection.
- Bumped Kubernetes client version to 1.9.
- Updated OOMKilling pattern to support new kernel.
0.5.0 - 2018-06-22
- Added custom problem detector plugin interface.
- Added custom network plugin monitor.
- Added a kernel log counter custom problem detector to detect problems which have the same pattern.
- Changed default port from 10256 to 20256 to avoid conflict with kube-proxy.
- Bumped golang version from 1.8 to 1.9.
- Bumped base image to
registry.k8s.io/debian-base-amd64:0.3
.
- Fixed an error in the labels applied to the daemonset label selector.
0.4.1 - 2017-06-21
- Added docker image pull error detection.
0.4.0 - 2017-04-31
- Added "kernel log generator" container for test purposes.
- Added ABRT adaptor config.
0.3.0 - 2017-03-15
- Added look back support in kernel monitor. Kernel monitor will look back for specified amount of time to detect old problems during each start or restart.
- Added support for running node-problem-detector standalone.
- Added
-hostname-override
option to provide custom node name. - Added
-port
option to provide custom listening port for service. - Added
-address
option to define binding address. - Added journald support.
- Added travis presubmit test.
- Added arbitrary system log support.
- Update kubernetes version to v1.4.0-beta.3
- Only change transition timestamp when condition has changed.
- #47 Don't
report KernelDeadlock on
unregister_netdevice
event. - #48 Use system boot time instead of "StartPattern".
0.2.0 - 2016-08-23
- Add support for some kernel oops detection.
- Change NPD to get node name from
NODE_NAME
env first beforeos.Hostname
, and update the example to get node name from downward api and setNODE_NAME
.
- Initial version of node problem detector.