Skip to content

Commit

Permalink
feat: add checks for performance tuning settings (#89)
Browse files Browse the repository at this point in the history
  • Loading branch information
marcoppenheimer authored Mar 30, 2023
1 parent b734741 commit b126c3d
Show file tree
Hide file tree
Showing 10 changed files with 545 additions and 187 deletions.
54 changes: 54 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,60 @@ After this is complete, Grafana will show two new dashboards: `Kafka Metrics` an
## Security
Security issues in the Charmed Kafka Operator can be reported through [LaunchPad](https://wiki.ubuntu.com/DebuggingSecurity#How%20to%20File). Please do not file GitHub issues about security issues.

## Performance Tuning
#### Virtual Memory Handling - Recommended
Kafka brokers make heavy use of the OS page cache to maintain performance. They never normally explicitly issue a command to ensure messages have been persisted to disk (`sync`), relying instead on the underlying OS to ensure that larger chunks (pages) of data are persisted from the page cache to the disk when the OS deems it efficient and/or necessary to do so. As such, there are a range of runtime kernel parameter tuning that are recommended to set on machines running Kafka to improve performance.

In order to configure these settings, one can write them to `/etc/sysctl.conf` using `sudo echo $SETTING >> /etc/sysctl.conf`. Note that the settings shown below are simply sensible defaults that may not apply to every workload:
```bash
# ensures low likelihood of memory being assigned to swap-space rather than drop pages from the page cache
vm.swappiness=1
# higher ratio results in less frequent disk flushes and better disk I/O performance
vm.dirty_ratio=80
vm.dirty_background_ratio=5
```

#### Memory Maps - Recommended
Each Kafka log segment requires an `index` file and a `timeindex` file, both requiring 1 map area. The default OS maximum number of memory map areas a process can have is set by `vm.max_map_count=65536`. For production deployments with a large number of partitions and log-segments, it is likely to exceed the maximum OS limit.

It is recommended to set the mmap number sufficiently higher than the number of memory mapped files. This can also be written to `/etc/sysctl.conf`:
```bash
vm.max_map_count=<new_mmap_value>
```

#### File Descriptors - Recommended
Kafka uses file descriptors for log segments and open connections. If a broker hosts many partitions, keep in mind that the broker requires **at least** `(number_of_partitions)*(partition_size/segment_size)` file descriptors to track all the log segments and number of connections.

In order to configure those limits, update the values and add the following to `/etc/security/limits.d/root.conf`:
```bash
#<domain> <type> <item> <value>
root soft nofile 262144
root hard nofile 1024288
```

#### Networking - Optional
If you are expecting a large amount of network traffic, kernel parameter tuning may help meet that expected demand. These can also be written to `/etc/sysctl.conf`:
```bash
# default send socket buffer size
net.core.wmem_default=
# default receive socket buffer size
net.core.rmem_default=
# maximum send socket buffer size
net.core.wmem_max=
# maximum receive socket buffer size
net.core.rmem_max=
# memory reserved for TCP send buffers
net.ipv4.tcp_wmem=
# memory reserved for TCP receive buffers
net.ipv4.tcp_rmem=
# TCP Window Scaling option
net.ipv4.tcp_window_scaling=
# maximum number of outstanding TCP connection requests
net.ipv4.tcp_max_syn_backlog=
# maximum number of queued packets on the kernel input side (useful to deal with spike of network requests).
net.core.netdev_max_backlog=
```

## Contributing

Expand Down
293 changes: 147 additions & 146 deletions poetry.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ optional = true
[tool.poetry.group.fmt.dependencies]
black = "^22.3.0"
ruff = ">=0.0.157"
pyright = "^1.1.300"

[tool.poetry.group.lint]
optional = true
Expand Down
80 changes: 40 additions & 40 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -102,9 +102,9 @@ jsonschema==4.17.3 ; python_version >= "3.8" \
kazoo==2.9.0 ; python_version >= "3.8" \
--hash=sha256:800318c7f3dab648cdf616dfb25bdabeefab2a6837aa6951e0fa3ff80e656969 \
--hash=sha256:b73dd6829ae6b3a78cd37ba393749f9d30860dbe39debb27c36eff159949a14e
ops==2.1.1 ; python_version >= "3.8" \
--hash=sha256:b2802068557c64121bf50ccfe101b043ed08c02ac5f958e55752c476b09de63d \
--hash=sha256:bed9f14b785efaa83f89b9c76b87bd1b3b82b0350c453b41dce778b39c7410b4
ops==2.2.0 ; python_version >= "3.8" \
--hash=sha256:14d4c43f4a4dc0830af1b2cb21313ef93308d3c389f7bedecc14a70b69000b64 \
--hash=sha256:90fa55249f5c3a7bcb7cc73731227eac867f9bb8426100d10724fe461de142a8
pkgutil-resolve-name==1.3.10 ; python_version < "3.9" and python_version >= "3.8" \
--hash=sha256:357d6c9e6a755653cfd78893817c0853af365dd51ec97f3d358a819373bbd174 \
--hash=sha256:ca27cc078d25c5ad71a9de0a7a330146c4e014c2462d9af19c6b828280649c5e
Expand All @@ -114,43 +114,43 @@ pure-sasl==0.6.2 ; python_version >= "3.8" \
pycparser==2.21 ; python_version >= "3.8" \
--hash=sha256:8ee45429555515e1f6b185e78100aea234072576aa43ab53aefcae078162fca9 \
--hash=sha256:e644fdec12f7872f86c58ff790da456218b10f863970249516d60a5eaca77206
pydantic==1.10.6 ; python_version >= "3.8" \
--hash=sha256:012c99a9c0d18cfde7469aa1ebff922e24b0c706d03ead96940f5465f2c9cf62 \
--hash=sha256:0abd9c60eee6201b853b6c4be104edfba4f8f6c5f3623f8e1dba90634d63eb35 \
--hash=sha256:12e837fd320dd30bd625be1b101e3b62edc096a49835392dcf418f1a5ac2b832 \
--hash=sha256:163e79386c3547c49366e959d01e37fc30252285a70619ffc1b10ede4758250a \
--hash=sha256:189318051c3d57821f7233ecc94708767dd67687a614a4e8f92b4a020d4ffd06 \
--hash=sha256:1c84583b9df62522829cbc46e2b22e0ec11445625b5acd70c5681ce09c9b11c4 \
--hash=sha256:3091d2eaeda25391405e36c2fc2ed102b48bac4b384d42b2267310abae350ca6 \
--hash=sha256:32937835e525d92c98a1512218db4eed9ddc8f4ee2a78382d77f54341972c0e7 \
--hash=sha256:3a2be0a0f32c83265fd71a45027201e1278beaa82ea88ea5b345eea6afa9ac7f \
--hash=sha256:3ac1cd4deed871dfe0c5f63721e29debf03e2deefa41b3ed5eb5f5df287c7b70 \
--hash=sha256:3ce13a558b484c9ae48a6a7c184b1ba0e5588c5525482681db418268e5f86186 \
--hash=sha256:415a3f719ce518e95a92effc7ee30118a25c3d032455d13e121e3840985f2efd \
--hash=sha256:43cdeca8d30de9a897440e3fb8866f827c4c31f6c73838e3a01a14b03b067b1d \
--hash=sha256:476f6674303ae7965730a382a8e8d7fae18b8004b7b69a56c3d8fa93968aa21c \
--hash=sha256:4c19eb5163167489cb1e0161ae9220dadd4fc609a42649e7e84a8fa8fff7a80f \
--hash=sha256:4ca83739c1263a044ec8b79df4eefc34bbac87191f0a513d00dd47d46e307a65 \
--hash=sha256:528dcf7ec49fb5a84bf6fe346c1cc3c55b0e7603c2123881996ca3ad79db5bfc \
--hash=sha256:53de12b4608290992a943801d7756f18a37b7aee284b9ffa794ee8ea8153f8e2 \
--hash=sha256:587d92831d0115874d766b1f5fddcdde0c5b6c60f8c6111a394078ec227fca6d \
--hash=sha256:60184e80aac3b56933c71c48d6181e630b0fbc61ae455a63322a66a23c14731a \
--hash=sha256:6195ca908045054dd2d57eb9c39a5fe86409968b8040de8c2240186da0769da7 \
--hash=sha256:61f1f08adfaa9cc02e0cbc94f478140385cbd52d5b3c5a657c2fceb15de8d1fb \
--hash=sha256:72cb30894a34d3a7ab6d959b45a70abac8a2a93b6480fc5a7bfbd9c935bdc4fb \
--hash=sha256:751f008cd2afe812a781fd6aa2fb66c620ca2e1a13b6a2152b1ad51553cb4b77 \
--hash=sha256:89f15277d720aa57e173954d237628a8d304896364b9de745dcb722f584812c7 \
--hash=sha256:8c32b6bba301490d9bb2bf5f631907803135e8085b6aa3e5fe5a770d46dd0160 \
--hash=sha256:acc6783751ac9c9bc4680379edd6d286468a1dc8d7d9906cd6f1186ed682b2b0 \
--hash=sha256:b1eb6610330a1dfba9ce142ada792f26bbef1255b75f538196a39e9e90388bf4 \
--hash=sha256:b243b564cea2576725e77aeeda54e3e0229a168bc587d536cd69941e6797543d \
--hash=sha256:b41822064585fea56d0116aa431fbd5137ce69dfe837b599e310034171996084 \
--hash=sha256:bbd5c531b22928e63d0cb1868dee76123456e1de2f1cb45879e9e7a3f3f1779b \
--hash=sha256:cf95adb0d1671fc38d8c43dd921ad5814a735e7d9b4d9e437c088002863854fd \
--hash=sha256:e277bd18339177daa62a294256869bbe84df1fb592be2716ec62627bb8d7c81d \
--hash=sha256:ea4e2a7cb409951988e79a469f609bba998a576e6d7b9791ae5d1e0619e1c0f2 \
--hash=sha256:f9289065611c48147c1dd1fd344e9d57ab45f1d99b0fb26c51f1cf72cd9bcd31 \
--hash=sha256:fd9b9e98068fa1068edfc9eabde70a7132017bdd4f362f8b4fd0abed79c33083
pydantic==1.10.7 ; python_version >= "3.8" \
--hash=sha256:01aea3a42c13f2602b7ecbbea484a98169fb568ebd9e247593ea05f01b884b2e \
--hash=sha256:0cd181f1d0b1d00e2b705f1bf1ac7799a2d938cce3376b8007df62b29be3c2c6 \
--hash=sha256:10a86d8c8db68086f1e30a530f7d5f83eb0685e632e411dbbcf2d5c0150e8dcd \
--hash=sha256:193924c563fae6ddcb71d3f06fa153866423ac1b793a47936656e806b64e24ca \
--hash=sha256:464855a7ff7f2cc2cf537ecc421291b9132aa9c79aef44e917ad711b4a93163b \
--hash=sha256:516f1ed9bc2406a0467dd777afc636c7091d71f214d5e413d64fef45174cfc7a \
--hash=sha256:6434b49c0b03a51021ade5c4daa7d70c98f7a79e95b551201fff682fc1661245 \
--hash=sha256:64d34ab766fa056df49013bb6e79921a0265204c071984e75a09cbceacbbdd5d \
--hash=sha256:670bb4683ad1e48b0ecb06f0cfe2178dcf74ff27921cdf1606e527d2617a81ee \
--hash=sha256:68792151e174a4aa9e9fc1b4e653e65a354a2fa0fed169f7b3d09902ad2cb6f1 \
--hash=sha256:701daea9ffe9d26f97b52f1d157e0d4121644f0fcf80b443248434958fd03dc3 \
--hash=sha256:7d45fc99d64af9aaf7e308054a0067fdcd87ffe974f2442312372dfa66e1001d \
--hash=sha256:80b1fab4deb08a8292d15e43a6edccdffa5377a36a4597bb545b93e79c5ff0a5 \
--hash=sha256:82dffb306dd20bd5268fd6379bc4bfe75242a9c2b79fec58e1041fbbdb1f7914 \
--hash=sha256:8c7f51861d73e8b9ddcb9916ae7ac39fb52761d9ea0df41128e81e2ba42886cd \
--hash=sha256:950ce33857841f9a337ce07ddf46bc84e1c4946d2a3bba18f8280297157a3fd1 \
--hash=sha256:976cae77ba6a49d80f461fd8bba183ff7ba79f44aa5cfa82f1346b5626542f8e \
--hash=sha256:9f6f0fd68d73257ad6685419478c5aece46432f4bdd8d32c7345f1986496171e \
--hash=sha256:a7cd2251439988b413cb0a985c4ed82b6c6aac382dbaff53ae03c4b23a70e80a \
--hash=sha256:abfb7d4a7cd5cc4e1d1887c43503a7c5dd608eadf8bc615413fc498d3e4645cd \
--hash=sha256:ae150a63564929c675d7f2303008d88426a0add46efd76c3fc797cd71cb1b46f \
--hash=sha256:b0f85904f73161817b80781cc150f8b906d521fa11e3cdabae19a581c3606209 \
--hash=sha256:b4a849d10f211389502059c33332e91327bc154acc1845f375a99eca3afa802d \
--hash=sha256:c15582f9055fbc1bfe50266a19771bbbef33dd28c45e78afbe1996fd70966c2a \
--hash=sha256:c230c0d8a322276d6e7b88c3f7ce885f9ed16e0910354510e0bae84d54991143 \
--hash=sha256:cc1dde4e50a5fc1336ee0581c1612215bc64ed6d28d2c7c6f25d2fe3e7c3e918 \
--hash=sha256:cf135c46099ff3f919d2150a948ce94b9ce545598ef2c6c7bf55dca98a304b52 \
--hash=sha256:cfc83c0678b6ba51b0532bea66860617c4cd4251ecf76e9846fa5a9f3454e97e \
--hash=sha256:d2a5ebb48958754d386195fe9e9c5106f11275867051bf017a8059410e9abf1f \
--hash=sha256:d71e69699498b020ea198468e2480a2f1e7433e32a3a99760058c6520e2bea7e \
--hash=sha256:d75ae19d2a3dbb146b6f324031c24f8a3f52ff5d6a9f22f0683694b3afcb16fb \
--hash=sha256:dfe2507b8ef209da71b6fb5f4e597b50c5a34b78d7e857c4f8f3115effaef5fe \
--hash=sha256:e0cfe895a504c060e5d36b287ee696e2fdad02d89e0d895f83037245218a87fe \
--hash=sha256:e79e999e539872e903767c417c897e729e015872040e56b96e67968c3b918b2d \
--hash=sha256:ecbbc51391248116c0a055899e6c3e7ffbb11fb5e2a4cd6f2d0b93272118a209 \
--hash=sha256:f4a2b50e2b03d5776e7f21af73e2070e1b5c0d0df255a827e7c632962f8315af
pyrsistent==0.19.3 ; python_version >= "3.8" \
--hash=sha256:016ad1afadf318eb7911baa24b049909f7f3bb2c5b1ed7b6a8f21db21ea3faa8 \
--hash=sha256:1a2994773706bbb4995c31a97bc94f1418314923bd1048c6d964837040376440 \
Expand Down
6 changes: 6 additions & 0 deletions src/charm.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@

from auth import KafkaAuth
from config import KafkaConfig
from health import KafkaHealth
from literals import (
ADMIN_USER,
CHARM_KEY,
Expand Down Expand Up @@ -58,6 +59,7 @@ def __init__(self, *args):
self.kafka_config = KafkaConfig(self)
self.tls = KafkaTLS(self)
self.provider = KafkaProvider(self)
self.health = KafkaHealth(self)
self.restart = RollingOpsManager(self, relation="restart", callback=self._restart)

self.framework.observe(getattr(self.on, "start"), self._on_start)
Expand Down Expand Up @@ -185,6 +187,10 @@ def _on_update_status(self, _: EventBase) -> None:
self._set_status(Status.ZK_NOT_CONNECTED)
return

if not self.health.machine_configured():
self._set_status(Status.SYSCONF_NOT_OPTIMAL)
return

self._set_status(Status.ACTIVE)

def _on_storage_attached(self, event: StorageAttachedEvent) -> None:
Expand Down
177 changes: 177 additions & 0 deletions src/health.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
#!/usr/bin/env python3
# Copyright 2023 Canonical Ltd.
# See LICENSE file for licensing details.

"""Manager for handling Kafka machine health."""

import json
import logging
import subprocess
from statistics import mean
from typing import TYPE_CHECKING, Tuple

from ops.framework import Object

if TYPE_CHECKING:
from charm import KafkaCharm

logger = logging.getLogger(__name__)


class KafkaHealth(Object):
"""Manager for handling Kafka machine health."""

def __init__(self, charm) -> None:
super().__init__(charm, "kafka_health")
self.charm: "KafkaCharm" = charm

@property
def _service_pid(self) -> int:
"""Gets most recent Kafka service pid from the snap logs."""
return self.charm.snap.get_service_pid()

def _get_current_memory_maps(self) -> int:
"""Gets the current number of memory maps for the Kafka process."""
return int(
subprocess.check_output(
f"cat /proc/{self._service_pid}/maps | wc -l",
shell=True,
stderr=subprocess.PIPE,
universal_newlines=True,
)
)

def _get_current_max_files(self) -> int:
"""Gets the current file descriptor limit for the Kafka process."""
return int(
subprocess.check_output(
rf"cat /proc/{self._service_pid}/limits | grep files | awk '{{print $5}}'",
shell=True,
stderr=subprocess.PIPE,
universal_newlines=True,
)
)

def _get_max_memory_maps(self) -> int:
"""Gets the current memory map limit for the machine."""
return int(
subprocess.check_output(
"sysctl -n vm.max_map_count",
shell=True,
stderr=subprocess.PIPE,
universal_newlines=True,
)
)

def _get_vm_swappiness(self) -> int:
"""Gets the current vm.swappiness configured for the machine."""
return int(
subprocess.check_output(
"sysctl -n vm.swappiness",
shell=True,
stderr=subprocess.PIPE,
universal_newlines=True,
)
)

def _get_partitions_size(self) -> Tuple[int, int]:
"""Gets the number of partitions and their average size from the log dirs."""
log_dirs_command = [
"--describe",
f"--bootstrap-server {','.join(self.charm.kafka_config.bootstrap_server)}",
f"--command-config {self.charm.kafka_config.client_properties_filepath}",
]
log_dirs = self.charm.snap.run_bin_command(
bin_keyword="log-dirs", bin_args=log_dirs_command
)

dirs = {}
for line in log_dirs.splitlines():
try:
# filters stdout to only relevant lines
dirs = json.loads(line)
break
except json.decoder.JSONDecodeError:
continue

if not dirs:
return (0, 0)

partitions = []
sizes = []
for broker in dirs["brokers"]:
for log_dir in broker["logDirs"]:
for partition in log_dir["partitions"]:
partitions.append(partition["partition"])
sizes.append(int(partition["size"]))

if not sizes or not partitions:
return (0, 0)

average_partition_size = mean(sizes)
total_partitions = len(partitions)

return (total_partitions, average_partition_size)

def _check_memory_maps(self) -> bool:
"""Checks that the number of used memory maps is not approaching threshold."""
max_maps = self._get_max_memory_maps()
current_maps = self._get_current_memory_maps()

# eyeballing warning if 80% used, can be changed
if max_maps * 0.8 <= current_maps:
logger.warning(
f"number of Kafka memory maps {current_maps} is approaching limit of {max_maps} - increase /etc/sysctl.conf vm.max_map_count limit and restart machine"
)
return False

return True

def _check_file_descriptors(self) -> bool:
"""Checks that the number of used file descriptors is not approaching threshold."""
if not self.charm.kafka_config.client_listeners:
return True

total_partitions, average_partition_size = self._get_partitions_size()
segment_size = int(self.charm.config["log_segment_bytes"])

minimum_fd_limit = total_partitions * (average_partition_size / segment_size)
current_max_files = self._get_current_max_files()

# eyeballing warning if 80% used, can be changed
if current_max_files * 0.8 <= minimum_fd_limit:
logger.warning(
f"number of required Kafka file descriptors {minimum_fd_limit} is approaching limit of {current_max_files} - increase /etc/security/limits.d/root.conf limit and restart machine"
)
return False

return True

def _check_vm_swappiness(self) -> bool:
"""Checks that vm.swappiness is configured correctly on the machine."""
vm_swappiness = self._get_vm_swappiness()

if vm_swappiness > 1:
logger.error(
f"machine vm.swappiness setting of {vm_swappiness} is higher than 1 - set /etc/syscl.conf vm.swappiness=1 and restart machine"
)
return False

return True

def machine_configured(self) -> bool:
"""Checks machine configuration for healthy settings.
Returns:
True if settings safely configured. Otherwise False
"""
if not all(
[
self._check_memory_maps(),
self._check_file_descriptors(),
self._check_vm_swappiness(),
]
):
return False

return True
4 changes: 4 additions & 0 deletions src/literals.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,3 +77,7 @@ class Status(Enum):
WaitingStatus("internal broker credentials not yet added"), "INFO"
)
NO_CERT = StatusLevel(WaitingStatus("unit waiting for signed certificates"), "INFO")
SYSCONF_NOT_OPTIMAL = StatusLevel(
ActiveStatus("machine system settings are not optimal - see logs for info"),
"WARNING",
)
Loading

0 comments on commit b126c3d

Please sign in to comment.