Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST] v1.4.1 release testing - One node upgrade with PCI device #1787

Open
TachunLin opened this issue Jan 14, 2025 · 1 comment
Open

[TEST] v1.4.1 release testing - One node upgrade with PCI device #1787

TachunLin opened this issue Jan 14, 2025 · 1 comment

Comments

@TachunLin
Copy link
Contributor

TachunLin commented Jan 14, 2025

What's the test to develop? Please describe

Perform the One node upgrade with PCI device for v1.4.1 release testing

Prerequisite and dependency of test

  • PCI enabled
  • VLAN 1 network on mgmt and 1 network on other NICs
  • 2 Virtual machines with data and md5sum computed- 1 running, 1 stopped
  • Enable PCI passthrough in addons and enable passthrough on PCI device (GPU or NIC)
  • 1 VM with PCI device (GPU/NIC) assigned and running.
  • Validate that the device is working
  • 2 VM backup, snapshots - 1 backup when VM is running and 1 backup when VM is stooped

Upgrade Test Path:

  • H1.4.0 + R2.9.3 (K1.29) -> H1.4.1 + R2.9.5 (K1.30)

OS:

  • SLES 15 SP6 (sles@ipv4)
  • SLE Micro 6 [sles]

Upgrade Test Check

  1. Can correctly upgrade from v1.4.0 to v1.4.1-rc1 or rc2

Post Upgrade Check

  1. Dependencies Check

  2. Virtual machines should be stopped during the upgrade and on doing restart should be accessible.

  3. Restore the backups, check the data

  4. Image and volume status

  5. Monitoring chart status

  6. VM operations are highlighted and working fine.

  7. Import Harvester into a Rancher cluster

  8. Add a node after the upgrade

  9. Create a guest cluster

  10. Upgrade harvester-cloud-provider and harvester-cdi-driver apps

  11. Reboot the Harvester node.

@TachunLin
Copy link
Contributor Author

TachunLin commented Jan 17, 2025

One nodes upgrade with PCI device

Test environment

  • Harvester: v1.4.0 upgrade to v1.4.1-rc1
  • Rancher: v2.9.3
  • RKE2 version: v1.29.11
  • OS:
    • SLES-15 SP6
    • SLE Micro 6

Prerequisite and dependency setup

${\textsf{✅}}$ Enable pcidevices-controller addon $~~$

image

${\textsf{✅}}$ Enable GPU PCI device passthrough $~~$

image

${\textsf{✅}}$ VLAN 2011 network on other NICs $~~$

image

${\textsf{✅}}$ Create RKE2 v1.29.11 cluster on SLES-15 SP6 and SLE Micro 6 $~~$

image

${\textsf{❌}}$ One VM with PCI GPU device assigned and running $~~$
${\textsf{✅}}$ Retest: one VM with PCI GPU device assigned and running $~~$
  1. After checked the workaround steps in [BUG] VM with assigned PCI GPU can not start after v1.3.2 to v1.4.0-rc4 upgrade due to no /dev/vfio/xx path  harvester#6892 (comment)
  2. Then disable the GPU pci device passthrough
  3. And disable the pci-device-controller addon
  4. Then start over to enable the addon and GPU pci device
  5. This time can correctly assign PCI GPU device to the VM

image

vokoscreenNG-2025-01-18_23-18-01.mp4
${\textsf{✅}}$ Install Nvidia GPU driver and cuda-toolkit $~~$
  1. Complete the environment setup
    Setup Env # zypper install pciutils gcc git-core # zypper addrepo https://download.opensuse.org/repositories/home:pontostroy/openSUSE_Factory/home:pontostroy.repo # zypper refresh # zypper install libOpenCL1 # zypper addrepo https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/cuda-sles15.repo # SUSEConnect --product PackageHub/15.6/x86_64 # zypper refresh # zypper install nvidia-open-driver-G06-kmp-default nvidia-compute-utils-G06 # reboot
  2. Check setup

image

  1. Check & Run CUDA Sample
```
Install CUDA Toolkit
# zypper in cuda-toolkit

Run CUDA Sample
# git clone https://github.com/nvidia/cuda-samples
# cd cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm
# make
# ./cudaTensorCoreGemm
```

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant