Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support and test "re-imageable" compute nodes via compute node metadata #518

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
510cfd0
extend cookiecutter terraform config for compute init script
bertiethorpe Jan 6, 2025
a08f984
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 6, 2025
8290a31
define default compute init flags
bertiethorpe Jan 7, 2025
b820632
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 7, 2025
354ce1e
add CI tests for compute node rebuilds
bertiethorpe Jan 7, 2025
b903cdd
document metadata toggle flags and CI workflow
bertiethorpe Jan 7, 2025
2bea51c
review suggestions
bertiethorpe Jan 8, 2025
def6bc3
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 8, 2025
038ddf7
add delay for ansible-init to finish
bertiethorpe Jan 9, 2025
ed810f2
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
08eff97
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
7057c50
remove delay in compute node rebuild ci
bertiethorpe Jan 9, 2025
6d992bf
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
3faa813
fix compute init metadata flags
bertiethorpe Jan 9, 2025
bc16dba
bump image
bertiethorpe Jan 9, 2025
68561b4
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
5193ba2
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 9, 2025
d2e18d0
bump image
bertiethorpe Jan 10, 2025
438ed3a
adjust check_slurm logic to deal with idle* state
bertiethorpe Jan 10, 2025
fd5cbf9
pause in workflow to debug slurm state
bertiethorpe Jan 14, 2025
f661c7f
debug wait on failure
bertiethorpe Jan 14, 2025
81c316a
allow empty compute_init_enable list
bertiethorpe Jan 14, 2025
bccc88b
Merge branch 'main' into feat/compute-init-cookiecutter
bertiethorpe Jan 14, 2025
9897f29
bump images
bertiethorpe Jan 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 8 additions & 19 deletions .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -170,33 +170,22 @@ jobs:
env:
TESTUSER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}

# - name: Build environment-specific compute image
# id: packer_build
# run: |
# . venv/bin/activate
# . environments/.stackhpc/activate
# cd packer/
# packer init
# PACKER_LOG=1 packer build -except openstack.fatimage -on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
# ../dev/output_manifest.py packer-manifest.json # Sets NEW_COMPUTE_IMAGE_ID outputs

# - name: Test reimage of compute nodes to new environment-specific image (via slurm)
# run: |
# . venv/bin/activate
# . environments/.stackhpc/activate
# ansible login -v -a "sudo scontrol reboot ASAP nextstate=RESUME reason='rebuild image:${{ steps.packer_build.outputs.NEW_COMPUTE_IMAGE_ID }}' ${TF_VAR_cluster_name}-compute-[0-3]"
# ansible compute -m wait_for_connection -a 'delay=60 timeout=600' # delay allows node to go down
# ansible-playbook -v ansible/ci/check_slurm.yml

- name: Test reimage of login and control nodes (via rebuild adhoc)
run: |
. venv/bin/activate
. environments/.stackhpc/activate
ansible-playbook -v --limit control,login ansible/adhoc/rebuild.yml
ansible all -m wait_for_connection -a 'delay=60 timeout=600' # delay allows node to go down
ansible-playbook -v ansible/site.yml
ansible-playbook -v ansible/ci/check_slurm.yml

- name: Test reimage of compute nodes and compute-init (via rebuild adhoc)
run: |
. venv/bin/activate
. environments/.stackhpc/activate
ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml
ansible all -m wait_for_connection -a 'delay=60 timeout=600' # delay allows node to go down
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved
ansible-playbook -v ansible/ci/check_slurm.yml

- name: Check sacct state survived reimage
run: |
. venv/bin/activate
Expand Down
31 changes: 28 additions & 3 deletions ansible/roles/compute_init/README.md
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,35 @@ The following roles/groups are currently fully functional:
node and all compute nodes.
- `openhpc`: all functionality

# Development/debugging
All of the above are defined in the skeleton cookiecutter config, and are
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved
toggleable via a terraform compute_init autovar file. In the .stackhpc
environment, the compute init roles are set by default to:
- `enable_compute`: This encompasses the openhpc role functionality while being
a global toggle for the entire compute-init script.
- `etc_hosts`
- `nfs`
- `basic_users`
- `eessi`

# CI workflow
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved

The compute node rebuild is tested in CI after the tests for rebuilding the
login and control nodes. The process follows

1. Compute nodes are reimaged:

ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml

To develop/debug this without actually having to build an image:
2. Ansible-init runs against newly reimaged compute nodes

3. Run sinfo and check nodes have expected slurm state

ansible-playbook -v ansible/ci/check_slurm.yml

# Development/debugging

To develop/debug changes to the compute script without actually having to build
a new image:

1. Deploy a cluster using tofu and ansible/site.yml as normal. This will
additionally configure the control node to export compute hostvars over NFS.
Expand Down Expand Up @@ -103,7 +128,7 @@ as in step 3.
available v the current approach:

```
[root@rl9-compute-0 rocky]# grep hostvars /mnt/cluster/hostvars/rl9-compute-0/hostvars.yml
[root@rl9-compute-0 rocky]# grep hostvars /mnt/cluster/hostvars/rl9-compute-0/hostvars.yml
"grafana_address": "{{ hostvars[groups['grafana'].0].api_address }}",
"grafana_api_address": "{{ hostvars[groups['grafana'].0].internal_address }}",
"mysql_host": "{{ hostvars[groups['mysql'] | first].api_address }}",
Expand Down
7 changes: 7 additions & 0 deletions environments/.stackhpc/terraform/compute_init.auto.tfvars
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
compute_init_enable = [
"compute",
"etc_hosts",
"nfs",
"basic_users",
"eessi"
]
5 changes: 5 additions & 0 deletions environments/.stackhpc/terraform/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ variable "k3s_token" {
type = string
}

variable "compute_init_enable" {
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved
type = list(string)
}

data "openstack_images_image_v2" "cluster" {
name = var.cluster_image[var.os_version]
most_recent = true
Expand All @@ -74,6 +78,7 @@ module "cluster" {
cluster_image_id = data.openstack_images_image_v2.cluster.id
control_node_flavor = var.control_node_flavor
k3s_token = var.k3s_token
compute_init_enable = var.compute_init_enable
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved

login_nodes = {
login-0: var.other_node_flavor
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,6 @@ module "compute" {
k3s_token = var.k3s_token
control_address = [for n in openstack_compute_instance_v2.control["control"].network: n.fixed_ip_v4 if n.access_network][0]
security_group_ids = [for o in data.openstack_networking_secgroup_v2.nonlogin: o.id]

compute_init_enable = var.compute_init_enable
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved
}
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,16 @@ resource "openstack_compute_instance_v2" "compute" {
}

metadata = {
environment_root = var.environment_root
k3s_token = var.k3s_token
control_address = var.control_address
environment_root = var.environment_root
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved
k3s_token = var.k3s_token
control_address = var.control_address
enable_compute = contains(var.compute_init_enable, "compute")
enable_resolv_conf = contains(var.compute_init_enable, "resolv_conf")
enable_etc_hosts = contains(var.compute_init_enable, "etc_hosts")
enable_nfs = contains(var.compute_init_enable, "nfs")
enable_manila = contains(var.compute_init_enable, "manila")
enable_basic_users = contains(var.compute_init_enable, "basic_users")
enable_eessi = contains(var.compute_init_enable, "eessi")
}

user_data = <<-EOF
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,9 @@ variable "control_address" {
description = "Name/address of control node"
type = string
}

variable "compute_init_enable" {
type = list(string)
description = "Groups to activate for ansible-init compute rebuilds"
default = []
}
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ variable "compute" {
image_id: Overrides variable cluster_image_id
vnic_type: Overrides variable vnic_type
vnic_profile: Overrides variable vnic_profile
compute_init_enable: Toggles ansible-init rebuild
bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved
EOF
}

Expand Down Expand Up @@ -136,3 +137,9 @@ variable "k3s_token" {
description = "K3s cluster authentication token, set automatically by Ansible"
type = string
}

bertiethorpe marked this conversation as resolved.
Show resolved Hide resolved
variable "compute_init_enable" {
type = list(string)
description = "Groups to activate for ansible-init compute rebuilds"
default = []
}