You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.
In a bare-metall cluster of 24 (desktop) machines that boot CoreOS via PXE boot, machines randomly fail/freeze from time to time. All machines have been affected over time. The failure scenario is absolutely the same for any failing machine. See below for detailed description of system freeze.
Container Linux Version
$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2317.0.1
VERSION_ID=2317.0.1
BUILD_ID=2019-11-06-2121
PRETTY_NAME="Container Linux by CoreOS 2317.0.1 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
Environment
The cluster is configured to host Kubernetes. All 24 machines have Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz with 32 GB memory and a 220 GB SSD. The machines are booted via PXE with the following ignition config:
If a machine freezes, it is still powered on, screen shows a fully static output (usually some log messages and login screen), no network connectivity (but NIC is powered and blinks), no reaction to keyboard plugging or input. Only after a reboot the machine is up again and behaves as normal. The journald logging completely stops with the freeze. No kernel core dump or journal log that indicates the problem.
Reproduction Steps
I don't know if someone else can reproduce the problem. However, I can reproduce the failure scenario with 95% probability if I deploy an Apache Cassandra cluster on the Kubernetes cluster and run a standard batch data ingest of few gigabytes that takes about one hour where one or two machines usually fail as described during that ingest. Since I have no idea where to continue, I need some help or guidance to find the problem.
Other Information
I have tried various configurations and options:
I have tried the following versions of CoreOS: 1520.9.0, 2079.3.0, 2247.5.0, 2303.0.0, 2317.0.1 (The problem is the same for all of them.)
I have tried different cgroup drivers for Docker: systemd and cgroupfs, as recommended by Kubernetes. (The failure occurs with both.)
I have tried different CNI plugins for Kubernetes: flannel and calico. (The failure occurs with both.)
I have added debug boot option to get some information about the failure. However, I can't find any suspicious message. A snippet of the journal log is attached here where the last message before the reboot denotes roughly the time of system freeze:
Issue Report
Bug
In a bare-metall cluster of 24 (desktop) machines that boot CoreOS via PXE boot, machines randomly fail/freeze from time to time. All machines have been affected over time. The failure scenario is absolutely the same for any failing machine. See below for detailed description of system freeze.
Container Linux Version
Environment
The cluster is configured to host Kubernetes. All 24 machines have Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz with 32 GB memory and a 220 GB SSD. The machines are booted via PXE with the following ignition config:
Expected Behavior
No freezes.
Actual Behavior
If a machine freezes, it is still powered on, screen shows a fully static output (usually some log messages and login screen), no network connectivity (but NIC is powered and blinks), no reaction to keyboard plugging or input. Only after a reboot the machine is up again and behaves as normal. The journald logging completely stops with the freeze. No kernel core dump or journal log that indicates the problem.
Reproduction Steps
I don't know if someone else can reproduce the problem. However, I can reproduce the failure scenario with 95% probability if I deploy an Apache Cassandra cluster on the Kubernetes cluster and run a standard batch data ingest of few gigabytes that takes about one hour where one or two machines usually fail as described during that ingest. Since I have no idea where to continue, I need some help or guidance to find the problem.
Other Information
I have tried various configurations and options:
journalctl --since "2019-11-08 18:42:29" --lines 250
The text was updated successfully, but these errors were encountered: