Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM snap tests #4

Merged
merged 39 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
e347385
Update structure to comply with bootstack template
Deezzir Sep 6, 2024
89b10bd
Add install instructions to the README
Deezzir Sep 6, 2024
b531437
Add test structure and snap setup
Deezzir Sep 6, 2024
74f7c35
Update .gitignore
Deezzir Sep 6, 2024
ab9f502
Make dcgm-exporter into a daemon
Deezzir Sep 6, 2024
931c25d
Add simple smoke test for dcgm-exporter endpoint
Deezzir Sep 6, 2024
b4f5877
Update README.md
Deezzir Sep 6, 2024
36fd757
Fix format
Deezzir Sep 6, 2024
e257ad7
Remove unneccesary envar for dcgm-exporter daemon
Deezzir Sep 6, 2024
5d5047b
Un-omit the test directory for the additional coverage
Deezzir Sep 11, 2024
5be4161
Remove redundant unit test commands
Deezzir Sep 11, 2024
349ee7a
Make dcgm-exporter listen address configurable
Deezzir Sep 11, 2024
f53e647
Merge branch 'main' into main
Deezzir Sep 11, 2024
48a5059
Remove unittest from Makefile
Deezzir Sep 11, 2024
818bd7a
Improve dcgm-exporter endpoint test
Deezzir Sep 11, 2024
b0e1114
Merge branch 'main' into main
Deezzir Sep 11, 2024
c201332
Add missing trailing whitespace
Deezzir Sep 11, 2024
36a913f
Add simple tests for other components
Deezzir Sep 11, 2024
558007e
Improve dcgmi test
Deezzir Sep 12, 2024
9e24944
Revert README.md
Deezzir Sep 12, 2024
2577dfb
Switch to snap services subcommands
Deezzir Sep 12, 2024
4dfd250
Remane dcgm-exporter snap config
Deezzir Sep 12, 2024
bfb452a
Simplify dcgm-exporter test
Deezzir Sep 12, 2024
6c32bfc
Remove makefile and rename.sh
Deezzir Sep 12, 2024
19faa11
Imrpove & simplify tests
Deezzir Sep 12, 2024
9774fde
Add func test to the CI
Deezzir Sep 12, 2024
8738bc7
Fix check.yaml
Deezzir Sep 12, 2024
bea3f71
Fix check.yaml
Deezzir Sep 12, 2024
55874a9
Merge build and func CI jobs
Deezzir Sep 12, 2024
b740fa0
Remove redundant step for test job
Deezzir Sep 12, 2024
3138f0a
Revert CI to separate jobs for build and func
Deezzir Sep 12, 2024
c2ea073
Fix artifact path
Deezzir Sep 12, 2024
439376e
Refine
Deezzir Sep 13, 2024
60bdf85
Refinements
Deezzir Sep 13, 2024
76a8eee
Merge branch 'main' into main
Deezzir Sep 13, 2024
6987442
Refine tests and comments
Deezzir Sep 13, 2024
2ea70fd
Align check.yaml with check workflows
Deezzir Sep 13, 2024
4059a6d
Update comments for default configs
Deezzir Sep 13, 2024
bd6f22e
Revert back the bind configs after the test
Deezzir Sep 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 48 additions & 2 deletions .github/workflows/check.yaml
aieri marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,29 @@ jobs:
with:
fetch-depth: 0 # Complete git history is required to generate the version from git tags.

- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install dependencies
run: |
sudo apt update
sudo apt install -y yamllint
python -m pip install --upgrade pip
# pin tox to the current major version to avoid
# workflows breaking all at once when a new major version is released.
python -m pip install 'tox<5'

- name: Lint yaml files
run: |
yamllint .yamllint snap/snapcraft.yaml
run: yamllint .yamllint snap/snapcraft.yaml

- name: Lint tests
run: tox -e lint

build:
needs:
- lint
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
Expand All @@ -45,3 +58,36 @@ jobs:

- name: Verify snap builds successfully
uses: snapcore/action-build@v1

- name: Upload the built snap
uses: actions/upload-artifact@v4
with:
name: SNAP_FILE
path: dcgm*.snap

func:
needs:
- build
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Complete git history is required to generate the version from git tags.

- name: Download the built snap
uses: actions/download-artifact@v4
with:
name: SNAP_FILE

- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install 'tox<5'

- name: Run unit tests
run: tox -e func
16 changes: 16 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,19 @@
# This is a template `.gitignore` file for snaps

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Tests files and dir
.pytest_cache/
.coverage
.tox
.venv
reports/
**/report/
htmlcov/
.mypy_cache

# Log files
*.log
Expand All @@ -17,6 +28,11 @@ reports/
# version data
repo-info

# Python builds
deb_dist/
dist/
*.egg-info/

# Snaps
*.snap

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ You can build the snap locally by using the command:

```shell
snapcraft --use-lxd
```
```
62 changes: 62 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
[tool.flake8]
ignore = ["C901", "D100", "D101", "D102", "D103", "W503", "W504"]
exclude = ['.eggs', '.git', '.tox', '.venv', '.build', 'build', 'report']
max-line-length = 99
max-complexity = 10

[tool.black]
line-length = 99
exclude = '''
/(
| .eggs
| .git
| .tox
| .venv
| .build
| build
| report
)/
'''

[tool.isort]
profile = "black"
skip_glob = [
".eggs",
".git",
".tox",
".venv",
".build",
"build",
"report"
]

[tool.pylint]
max-line-length = 99
ignore = ['.eggs', '.git', '.tox', '.venv', '.build', 'report', 'tests']

[tool.mypy]
warn_unused_ignores = true
warn_unused_configs = true
warn_unreachable = true
disallow_untyped_defs = true
exclude = ['.eggs', '.git', '.tox', '.venv', '.build', 'report', 'tests']

## Ignore unsupported imports
[[tool.mypy.overrides]]
ignore_missing_imports = true
module = ["setuptools"]

[tool.coverage.run]
relative_files = true
source = ["."]
omit = ["docs/**", "lib/**", "snap/**", "build/**", "setup.py"]

[tool.coverage.report]
fail_under = 100
show_missing = true

[tool.coverage.html]
directory = "tests/unit/report/html"

[tool.coverage.xml]
output = "tests/unit/report/coverage.xml"
10 changes: 9 additions & 1 deletion snap/hooks/configure
aieri marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,12 @@
# Register config options if unset,
# so users can see available options by running
# `sudo snap get dcgm`.
[ -z "$(snapctl get nv-hostengine-port)" ] && snapctl set nv-hostengine-port=
if [ -z "$(snapctl get nv-hostengine-port)" ]; then
# Setting to empty string for the nv-hostengine binary to use the default port. (5555)
aieri marked this conversation as resolved.
Show resolved Hide resolved
snapctl set nv-hostengine-port="5555"
fi

if [ -z "$(snapctl get dcgm-exporter-address)" ]; then
# Setting to empty string for the dcgm-exporter binary to use the default address. (:9400)
snapctl set dcgm-exporter-address=":9400"
fi
14 changes: 14 additions & 0 deletions snap/local/run_dcgm_exporter.sh
aieri marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
set -euo pipefail

# Build the argument list for the dcgm-exporter command
args=()

# Add the dcgm-exporter-address option if it is set. Default: “:9400”
dcgm_exporter_address="$(snapctl get dcgm-exporter-address)"

if [ -n "$dcgm_exporter_address" ]; then
args+=("-a" "$dcgm_exporter_address")
fi

exec "$SNAP/bin/dcgm-exporter" "${args[@]}"
8 changes: 7 additions & 1 deletion snap/snapcraft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,15 @@ title: NVIDIA DCGM

apps:
dcgm-exporter:
command: bin/dcgm-exporter
command: run_dcgm_exporter.sh
plugs:
- network-bind
- opengl
daemon: simple
Deezzir marked this conversation as resolved.
Show resolved Hide resolved
# As this is a dcgm snap, not the dcgm-exporter snap,
# user might not be interested in running dcgm-exporter, so disable it by default
install-mode: disable
aieri marked this conversation as resolved.
Show resolved Hide resolved
restart-condition: on-failure
dcgmi:
command: usr/bin/dcgmi
plugs:
Expand Down Expand Up @@ -59,6 +64,7 @@ parts:
override-build: |
craftctl default
chmod +x run_nv_hostengine.sh
chmod +x run_dcgm_exporter.sh
dcgm-exporter:
after:
- wrapper
Expand Down
22 changes: 22 additions & 0 deletions tests/functional/conftest.py
Deezzir marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import subprocess

import pytest


@pytest.fixture(scope="session", autouse=True)
def install_dcgm_snap():
"""Install the snap and enable dcgm-exporter service for testing."""
snap_build_name = "dcgm_*.snap"

subprocess.run(
f"sudo snap install --dangerous {snap_build_name}",
check=True,
capture_output=True,
shell=True,
gabrielcocenza marked this conversation as resolved.
Show resolved Hide resolved
)

subprocess.run("sudo snap start dcgm.dcgm-exporter".split(), check=True)

yield

subprocess.run("sudo snap remove --purge dcgm".split(), check=True)
1 change: 1 addition & 0 deletions tests/functional/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
tenacity
78 changes: 78 additions & 0 deletions tests/functional/test_snap_dcgm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import json
import subprocess
import urllib.request

Deezzir marked this conversation as resolved.
Show resolved Hide resolved
import pytest
from tenacity import Retrying, retry, stop_after_delay, wait_fixed


@retry(wait=wait_fixed(5), stop=stop_after_delay(30))
def test_dcgm_exporter():
"""Test of the dcgm-exporter service and its endpoint."""
dcgm_exporter_service = "snap.dcgm.dcgm-exporter"
endpoint = "http://localhost:9400/metrics"

assert 0 == subprocess.call(
f"sudo systemctl is-active --quiet {dcgm_exporter_service}".split()
), f"{dcgm_exporter_service} is not running"

# Check the exporter endpoint, will raise an exception if the endpoint is not reachable
response = urllib.request.urlopen(endpoint)

# The output of the exporter endpoint is not tested
# as in a virtual environment it will not have any GPU metrics
assert 200 == response.getcode(), "DCGM exporter endpoint returned an error"


def test_dcgm_nv_hostengine():
aieri marked this conversation as resolved.
Show resolved Hide resolved
"""Check the dcgm-nv-hostengine service."""
nv_hostengine_service = "snap.dcgm.nv-hostengine"
nv_hostengine_port = 5555

assert 0 == subprocess.call(
f"sudo systemctl is-active --quiet {nv_hostengine_service}".split()
), f"{nv_hostengine_service} is not running"

assert 0 == subprocess.call(
f"nc -z localhost {nv_hostengine_port}".split()
), f"{nv_hostengine_service} is not listening on port {nv_hostengine_port}"


def test_dcgmi():
"""Test of the dcgmi command."""
result = subprocess.run(
"dcgm.dcgmi discovery -l".split(), check=True, capture_output=True, text=True
)

# Test if the command is working and outputs a table with the GPU ID
# The table will be empty in a virtual environment, but the command should still work
assert "GPU ID" in result.stdout.strip(), "DCGMI didn't produce the expected table"


bind_test_data = [
("dcgm.dcgm-exporter", "dcgm-exporter-address", ":9466"),
("dcgm.nv-hostengine", "nv-hostengine-port", "5566"),
]


@pytest.mark.parametrize("service, config, new_value", bind_test_data)
gabrielcocenza marked this conversation as resolved.
Show resolved Hide resolved
def test_dcgm_bind_config(service: str, config: str, new_value: str):
"""Test snap bind configuration."""
result = subprocess.run(
"sudo snap get dcgm -d".split(), check=True, capture_output=True, text=True
)
dcgm_snap_config = json.loads(result.stdout.strip())
assert config in dcgm_snap_config, f"{config} is not in the snap configuration"

assert 0 == subprocess.call(
f"sudo snap set dcgm {config}={new_value}".split()
), f"Failed to set {config} to {new_value}"

# restart the service to apply the new configuration
subprocess.run(f"sudo snap restart {service}".split(), check=True)

for attempt in Retrying(wait=wait_fixed(2), stop=stop_after_delay(10)):
with attempt:
assert 0 == subprocess.call(
f"nc -z localhost {new_value.lstrip(':')}".split()
), f"{service} is not listening on {new_value}"
42 changes: 42 additions & 0 deletions tox.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
[tox]
skipsdist=True
envlist = lint, unit, func
skip_missing_interpreters = True

[testenv]
basepython = python3
setenv = PYTHONPATH={toxinidir}

[testenv:lint]
commands =
pflake8
pylint --recursive=y .
black --check --diff --color .
isort --check --diff --color .
deps =
black
flake8
pyproject-flake8
flake8-docstrings
pep8-naming
flake8-colors
colorama
isort
pylint
{[testenv:func]deps}

[testenv:reformat]
envdir = {toxworkdir}/lint
deps = {[testenv:lint]deps}
commands =
black .
isort .

[testenv:func]
deps =
pytest
-r {toxinidir}/tests/functional/requirements.txt
passenv =
TEST_*
commands =
pytest {toxinidir}/tests/functional {posargs:-v}