Rocky 9.4 Base Os confluent/Slurm Edition for linux #2002

tkucherera-lenovo · 2024-08-07T13:54:54Z

This is a recipe that uses confluent for cluster provisioning.

Assumptions

DNS is setup
There is at least one SSH key on the SMS (the key is used for passwordless login on the nodes)

Note

The makerepo.sh file does not check for Rocky Linux so had to modify it to check if os is Rocky
the ohpc repo and epel repos directories are in /var/lib/confluent/public whereas on the compute nodes they can be reached via web root confluent-public

adrianreber · 2024-08-07T13:59:04Z

Thanks, this is great. I will try it out on our CI systems.

adrianreber · 2024-08-07T14:01:26Z

1. DNS is setup

We have /etc/hosts. I hope that is enough.

2. There is at least one SSH key on the SMS (the key is used for passwordless login on the nodes)

That is also needed for all other recipes. So, no problem.

adrianreber · 2024-08-07T14:04:33Z

The resulting RPMS can be found in the GitHub Actions for the next 24 hours.

github-actions · 2024-08-07T14:11:35Z

Test Results

14 files ±0 14 suites ±0 6s ⏱️ ±0s
41 tests ±0 41 ✅ ±0 0 💤 ±0 0 ❌ ±0
54 runs ±0 54 ✅ ±0 0 💤 ±0 0 ❌ ±0

Results for commit 8f82c3e. ± Comparison against base commit 8e59d8c.

♻️ This comment has been updated with latest results.

tkucherera-lenovo · 2024-08-07T15:03:53Z

Yes, /etc/hosts should be enough

adrianreber · 2024-08-07T17:32:07Z

docs/recipes/install/rocky9/x86_64/confluent/slurm/steps.tex

+\input{common/install_ohpc_components_intro}
+
+\subsection{Enable \OHPC{} repository for local use} \label{sec:enable_repo}
+\input{common/enable_local_ohpc_repo_confluent}


I am not aware of the history behind this line from the xcat recipe.In all other recipes we enable the OpenHPC repository by installing the OpenHPC release RPM which enabled a dnf repository from the OpenHPC repository server. Hardcoding the downloading of the repository tar files feels unnecessary especially as we do not do it at all in any of our current testing. Please try to work with the online repository if that would work for you.

If you need it for your testing we should put it behind some variable, so that it can be disabled.

Is this strictly necessary for you or can you work with the online repository?

Sure noted, l will look to work with the online repo.

adrianreber · 2024-08-07T17:34:10Z

docs/recipes/install/common/confluent_init_os_images_rocky.tex

+\subsubsection{Build initial BOS image} \label{sec:assemble_bos}
+The following steps illustrate the process to build a minimal, default image for use with \Confluent{}. To begin, you will
+first need to have a local copy of the ISO image available for the underlying OS. In this recipe, the relevant ISO image
+is \texttt{Rocky-9.4-x86\_64-dvd1.iso} (available from the Rocky


The image I downloaded does not have a "1" in the file name. The filename should be a variable so that it can be easily updated.

adrianreber · 2024-08-07T17:35:47Z

The main point which is currently not clear to me is if Confluent comes with a DHCP server? I was running the script a couple of times and the two compute nodes were always waiting for DHCP answers in the PXE boot step of the firmware.

adrianreber · 2024-08-09T13:08:46Z

@tkucherera-lenovo Should I try again? Is there now a DHCP server configured, somehow?

For the final merge you can squash the commits. For the main repository it makes no sense to keep your development history with fixups. If you want you do separate commits for the docs/ part and the components/ part, that would make sense to me.

Please also add a Signed-off-by to your commit messages as described in https://github.com/openhpc/ohpc/blob/3.x/CONTRIBUTING.md. git commit -s usually does that automatically.

tkucherera-lenovo · 2024-08-09T14:03:07Z

Yes, you can try again. Confluent does have its own dhcp server and by default it will respond to DHCP requests. If an environment has its own DHCP server, it is possible to configure confluent to not respond to DHCP requests. In this case though l believe there was a bug where the setting for allowing deployment using pxe was not being set because the variable needed was missing from the input.local file l have added a fix for that now.

going forward l will squash all commits and also add the signed-off-by to commits

adrianreber · 2024-08-09T14:08:51Z

@tkucherera-lenovo Is there an easy way to reset the host machine without reinstalling. Where does confluent store its state? Is there a directory I can delete to start from scratch?

tkucherera-lenovo · 2024-08-09T14:24:31Z

The state is stored /etc/confluent/*. So stopping confluent and running rm -rf /etc/confluent/*. l would also recommend removing the os profile under /var/lib/confluent/public/os dir

adrianreber · 2024-08-10T10:56:59Z

Now I see that the compute nodes are trying to boot:

==> audit <==
Aug 10 10:43:07 {"operation": "update", "target": "/noderange/compute/boot/nextdevice", "allowed": true}
Aug 10 10:43:12 {"operation": "update", "target": "/noderange/compute/power/state", "allowed": true}

==> events <==
Aug 10 10:46:16 {"info": "Offering PXE boot with static address 10.241.58.133 to c2"}
Aug 10 10:46:18 {"info": "Offering PXE boot with static address 10.241.58.132 to c1"}
Aug 10 10:46:25 {"info": "Offering PXE boot with static address 10.241.58.133 to c2"}
Aug 10 10:46:28 {"info": "Offering PXE boot with static address 10.241.58.132 to c1"}

==> /var/log/httpd/access_log <==
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot.ipxe HTTP/1.1" 200 227 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/kernel HTTP/1.1" 200 13605704 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/addons.cpio HTTP/1.1" 200 97792 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/site.cpio HTTP/1.1" 200 3072 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/distribution HTTP/1.1" 200 106800744 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:34 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot.ipxe HTTP/1.1" 200 227 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:34 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/kernel HTTP/1.1" 200 13605704 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:35 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/addons.cpio HTTP/1.1" 200 97792 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:35 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/site.cpio HTTP/1.1" 200 3072 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:35 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/distribution HTTP/1.1" 200 106800744 "-" "iPXE/1.21.1 (g988d2)"

But after that nothing seems to happen. On the console I see:

Any recommendations how to continue?

Also there seems to be no point during the installation where the script waits for the compute nodes to be ready, so most commands are run when the compute nodes are not available. All the customization fails with:

+ nodeshell compute echo '"10.241.58.134:/home' /home nfs nfsvers=3,nodev,nosuid 0 '0"' '>>' /etc/fstab
c1: ssh: connect to host c1 port 22: No route to host
c2: ssh: connect to host c2 port 22: No route to host

adrianreber · 2024-08-11T09:05:03Z

Now the installation is working but now it fails in post-installation scripts. I see on the server following error:

Aug 11 09:04:06 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 193, in sync_list_to_node
    sshutil.prep_ssh_key('/etc/confluent/ssh/automation')
  File "/opt/confluent/lib/python/confluent/sshutil.py", line 139, in prep_ssh_key
    subprocess.check_output(['ssh-add', keyname], stdin=devnull, stderr=devnull)
  File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ssh-add', '/etc/confluent/ssh/automation']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 520, in handle_request
    result = syncfiles.start_syncfiles(
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in start_syncfiles
    syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmpf19vlfb1/etc/shadow
Aug 11 09:04:07 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 193, in sync_list_to_node
    sshutil.prep_ssh_key('/etc/confluent/ssh/automation')
  File "/opt/confluent/lib/python/confluent/sshutil.py", line 139, in prep_ssh_key
    subprocess.check_output(['ssh-add', keyname], stdin=devnull, stderr=devnull)
  File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ssh-add', '/etc/confluent/ssh/automation']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 520, in handle_request
    result = syncfiles.start_syncfiles(
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in start_syncfiles
    syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmp2qhhryvt/etc/shadow

tkucherera-lenovo · 2024-08-11T14:15:28Z

Hi, Adrian l don't know what state the management server and cluster are in. But usually, the error that l seeing happens when the automation SSH key is missing from the /etc/confluent/ssh directory. This key should have been created during the osdeploy initialize step. The input.local file should have an initialize_options variable with the value usklpta where the a option creates the key in question.

Additionally just to be able to help me with debug. If you the command:

confluent_selfcheck -n <nodename>

That output is sometimes helpful in debug. Thanks.

adrianreber · 2024-08-11T17:14:29Z

docs/recipes/install/common/add_to_compute_confluent_intro.tex

+[sms](*\#*) mkdir -p $epel_repo_dir_confluent
+[sms](*\#*) (*\install*) dnf-plugins-core createrepo
+# Download required EPEL packages
+[sms](*\#*) dnf download --destdir $epel_repo_dir_confluent fping libconfuse libunwind


This seems strange, why don't we just enable EPEL on the compute nodes?

adrianreber · 2024-08-11T17:17:13Z

Hi, Adrian l don't know what state the management server and cluster are in. But usually, the error that l seeing happens when the automation SSH key is missing from the /etc/confluent/ssh directory. This key should have been created during the osdeploy initialize step. The input.local file should have an initialize_options variable with the value usklpta where the a option creates the key in question.

I just copied usklpt without the a. Retrying with the additional a now.

adrianreber · 2024-08-11T17:31:01Z

Now the compute nodes are provisioned, but I cannot login:

# confluent_selfcheck -n c1
OS Deployment: Initialized
Confluent UUID: Consistent
Web Server: Running
Web Certificate: OK
Checking web download: Failed to download /confluent-public/site/confluent_uuid
Checking web API access: Failed access, if selinux is enabled, `setsebool -P httpd_can_network_connect=1`, otherwise check web proxy configuration
TFTP Status: OK
SSH root user public key: OK
Checking SSH Certificate authority: OK
Checking confluent SSH automation key: OK
Checking for blocked insecure boot: OK
Checking IPv6 enablement: OK
Performing node checks for 'c1'
Checking node attributes in confluent...
Checking network configuration for c1
c1 appears to have network configuration suitable for IPv4 deployment via: ens2f0
No issues detected with attributes of c1
Checking name resolution: OK

Using warewulf 3 provisioning the ssh keys from /root/.ssh are automatically part of the compute nodes and ssh works. Can confluent also use one of those existing keys and add it to the compute node?

Also, the current recipe does not wait until the compute nodes are provisioned. It immediately continues and all commands like nodeshell fail, because the provisioning is not finished.

adrianreber · 2024-08-12T10:45:57Z

Ah, so the problem is, is that I have SSH keys in different formats and the last in the list is using an unsupported algorithm.

In /opt/confluent/lib/python/confluent/sshutil.py all SSH keys are copied to the provisioning image, but instead of overwriting the previous key it would probably make more sense to append all keys.

Following code change seems to work for me:

--- /opt/confluent/lib/python/confluent/sshutil.py	2023-11-15 16:30:46.000000000 +0000
+++ /opt/confluent/lib/python/confluent/sshutil.py.new	2024-08-12 09:10:48.601474767 +0000
@@ -214,10 +214,14 @@
     else:
         suffix = 'rootpubkey'
     for auth in authorized:
-        shutil.copy(
-            auth,
+        local_key = open(auth, 'r')
+        dest = open(
             '/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
-                    myname, suffix))
+                    myname, suffix), 'a')
+        dest.write(local_key.read())
+    if os.path.exists(
+            '/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
+                myname, suffix)):
         os.chmod('/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
                 myname, suffix), 0o644)
         os.chown('/var/lib/confluent/public/site/ssh/{0}.{1}'.format(

Instead of copying all the files and overwriting everything with the last file, this appends all public keys.

adrianreber · 2024-08-12T10:54:33Z

Now SSH works, but provisioning fails again:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 526, in handle_request
    status, output = syncfiles.get_syncresult(nodename)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 356, in get_syncresult
    result = syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmp9t4o6x20/etc/shadow
Aug 12 10:48:09 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node
    output, stderr = util.run(
  File "/opt/confluent/lib/python/confluent/util.py", line 48, in run
    raise subprocess.CalledProcessError(retcode, process.args, output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpszubn5dq.synctoc2/', 'root@[10.241.58.133]:/']' returned non-zero exit status 23.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 526, in handle_request
    status, output = syncfiles.get_syncresult(nodename)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 356, in get_syncresult
    result = syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmpw077yd0x/etc/shadow

It makes kind of sense, because /tmp/tmpw077yd0x/etc/shadow is indeed 000 but I am not sure what is going on, running the same rsync command as root works without errors.

Currently I am again stuck in provisioning:

# nodedeploy compute
c1: pending: rocky-9.4-x86_64-default
c2: pending: rocky-9.4-x86_64-default
# confluent_selfcheck -n c1
OS Deployment: Initialized
Confluent UUID: Consistent
Web Server: Running
Web Certificate: OK
Checking web download: Failed to download /confluent-public/site/confluent_uuid
Checking web API access: Failed access, if selinux is enabled, `setsebool -P httpd_can_network_connect=1`, otherwise check web proxy configuration
TFTP Status: OK
SSH root user public key: OK
Checking SSH Certificate authority: OK
Checking confluent SSH automation key: OK
Checking for blocked insecure boot: OK
Checking IPv6 enablement: OK
Performing node checks for 'c1'
Checking node attributes in confluent...
Checking network configuration for c1
c1 appears to have network configuration suitable for IPv4 deployment via: ens2f0
No issues detected with attributes of c1
Checking name resolution: OK

jjohnson42 · 2024-08-13T13:11:35Z

Following code change seems to work for me:

Pull request is welcome for that one. It has come up but we didn't quite get around to appending keys when dealing with multiple /root/.ssh/*.pub keys. https://github.com/xcat2/confluent/pulls

jjohnson42 · 2024-08-13T13:12:42Z

on the /etc/shadow issue, this is a consequence of confluent not being allowed to run as root, so for files like /etc/shadow, if that is desired, then you would need one readable by the confluent user. We frequently support doing /etc/passwd and 'stubbing out' shadow to be password disabled for accounts like that as an option.

adrianreber · 2024-08-13T15:23:01Z

on the /etc/shadow issue, this is a consequence of confluent not being allowed to run as root, so for files like /etc/shadow, if that is desired, then you would need one readable by the confluent user. We frequently support doing /etc/passwd and 'stubbing out' shadow to be password disabled for accounts like that as an option.

How could this be best automated in a recipe like we are trying to build here? Any recommendations?

jjohnson42 · 2024-08-13T15:30:16Z

I'd probably offer some example choices:
-Use 'Merge' support of /etc/passwd, do not include shadow. This will produce 'password disabled' instances of the users from passwd, for ssh key based access only
-Give confluent read access to /etc/shadow
-Make a blessed /etc/shadow copy for confluent to distribute
-Use a separate mechanism or invocation to push out /etc/shadow (e.g. nodersync manually run as the root user can do it).

I think we were imagining the first option, that sync targets aren't interested in the passwords.

Note that root password is a node attribute and can be set in the confluent db. The default is to disable root password unless specified. If set during deploy, it will get that root password into shadow (though before syncfiles run).

adrianreber · 2024-08-13T15:33:15Z

Following code change seems to work for me:

Pull request is welcome for that one. It has come up but we didn't quite get around to appending keys when dealing with multiple /root/.ssh/*.pub keys. https://github.com/xcat2/confluent/pulls

xcat2/confluent#159

adrianreber · 2024-08-13T15:38:47Z

I'd probably offer some example choices:

Use 'Merge' support of /etc/passwd, do not include shadow. This will produce 'password disabled' instances of the users from passwd, for ssh key based access only

Give confluent read access to /etc/shadow

Make a blessed /etc/shadow copy for confluent to distribute

Use a separate mechanism or invocation to push out /etc/shadow (e.g. nodersync manually run as the root user can do it).

I think we were imagining the first option, that sync targets aren't interested in the passwords.

Note that root password is a node attribute and can be set in the confluent db. The default is to disable root password unless specified. If set during deploy, it will get that root password into shadow (though before syncfiles run).

As this recipe is contributed by you (upstream confluent) I would let you decide how to design and implement it. Also with the proper warnings in the documentation. But whatever makes most sense for you. If the recipe results in a working cluster we are happy to include it. Maybe merge support makes sense as we do not use passwords anyway much (at all) or the blessed copy. I would defer this to you and your experience what makes most sense.

adrianreber · 2024-08-13T16:03:07Z

With a chmod 644 /etc/shadow I have a workaround. We should still have a proper solution in the recipe to handle /etc/shadow.

Following things needs to be fixed at this point:

the recipe needs to wait until the compute nodes are ready
epel-release needs to be installed on the compute nodes
ohpc-release needs to be installed on the compute nodes

For warewulf we do:

export CHROOT=/opt/ohpc/admin/images/rocky9.3
wwmkchroot -v rocky-9 $CHROOT
dnf -y --installroot $CHROOT install epel-release
cp -p /etc/yum.repos.d/OpenHPC*.repo $CHROOT/etc/yum.repos.d

As confluent first does the installation and then changes the running compute node, this approach will not work.
For Rocky and AlmaLinux something like this will work:

# nodeshell compute dnf -y  install epel-release
# nodeshell compute dnf -y  install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm

The following commands are unnecessary or do not work:

# nodeshell compute dnf -y  install ntp
# nodeshell compute dnf -y  install  --enablerepo=powertools lmod-ohpc #powertools does not exist, it is called crb and already enabled earlier
# nodeshell compute systemctl restart nfs
c1: Failed to restart nfs.service: Unit nfs.service not found.
c2: Failed to restart nfs.service: Unit nfs.service not found.

This is needed: nodeshell compute dnf -y install nfs-utils

The existing /etc/hosts from the SMS is not synced to the compute nodes.

Besides the items mentioned here we seem to be able to get a cluster with two compute nodes running.

The nice thing for OpenHPC is that with this recipe we would finally have a stateful provisioned recipe again.

When we used to have a XCAT stateful recipe, it was explicitly marked to be stateful, not sure how you want to do this. Do you want to have one recipe which can either do stateful or stateless? Or two recipes?

jjohnson42 · 2024-08-13T17:39:18Z

So if I'm understanding
Need to wait for nodedeploy to show:

 # nodedeploy r3u23
r3u23: completed: alma-9.4-x86_64-default

Changes to syncfiles to include:
/etc/hosts
/etc/yum.repos.d/OpenHPC*.repo $CHROOT/etc/yum.repos.d

And in post.d, to install epel-release

For nfs-utils, we could add it to the pkglist, or add a 'dnf -y install nfs-utils' as a 'post.d' script.

For diskless, maybe a different recipe. It will be more 'warewulf' like, with 'imgutil build' and 'imgutil exec'. There's also been a suggestion to make the 'installimage' script work for those instead of just clones.

adrianreber · 2024-08-13T18:03:01Z

/etc/yum.repos.d/OpenHPC*.repo $CHROOT/etc/yum.repos.d

Either install the repo file, but this requires to also copy the keys, or install the ohpc-release RPM via dnf.

jjohnson42 · 2024-08-14T19:27:02Z

@adrianreber To go back, did you want to do a pull request for the ssh key handling change, or did you want it done on your behalf? I kind of like the idea of the pull request to keep it clear who did what, but can just work it from your comment if preferred.

adrianreber · 2024-08-14T19:28:59Z

@adrianreber To go back, did you want to do a pull request for the ssh key handling change, or did you want it done on your behalf? I kind of like the idea of the pull request to keep it clear who did what, but can just work it from your comment if preferred.

I already did at xcat2/confluent#159

jjohnson42 · 2024-08-14T20:15:39Z

Thanks, sorry for not noticing sooner. I accepted and amended it just a tad (to empty out the file before writing, and using 'with' to manage open/close of the files.

jjohnson42 · 2024-08-28T17:13:58Z

@adrianreber FYI, confluent 3.11.0 has been released including your change for ssh pubkey handling.

tkucherera-lenovo · 2024-09-16T13:47:43Z

@adrianreber Since the compute nodes are provisioned without internet access running commands like nodeshell compute dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm would fail. Do you advise we set up a NAT gateway on the master node to give the computes access to the internet, or we follow what xcat recipe was doing which is locally setting up the ohpc repo copy and then configuring the repo which can be accessed by the computes via the web roots xcat would have set up see here:

# Add OpenHPC repo mirror hosted on SMS
[sms](*\#*) psh compute dnf config-manager --add-repo=http://$sms_ip/$ohpc_repo_dir/OpenHPC.local.repo
# Replace local path with SMS URL
[sms](*\#*) psh compute "perl -pi -e 's/file:\/\/\@PATH\@/http:\/\/$sms_ip\/"${ohpc_repo_dir//\//"\/"}"/s' \
        /etc/yum.repos.d/OpenHPC.local.repo"

adrianreber · 2024-09-16T13:54:58Z

@adrianreber Since the compute nodes are provisioned without internet access running commands like nodeshell compute dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm would fail. Do you advise we set up a NAT gateway on the master node to give the computes access to the internet, or we follow what xcat recipe was doing which is locally setting up the ohpc repo copy and then configuring the repo which can be accessed by the computes via the web roots xcat would have set up see here:
# Add OpenHPC repo mirror hosted on SMS
[sms](*\#*) psh compute dnf config-manager --add-repo=http://$sms_ip/$ohpc_repo_dir/OpenHPC.local.repo
# Replace local path with SMS URL
[sms](*\#*) psh compute "perl -pi -e 's/file:\/\/\@PATH\@/http:\/\/$sms_ip\/"${ohpc_repo_dir//\//"\/"}"/s' \
        /etc/yum.repos.d/OpenHPC.local.repo"

Hmm, I see. In our test setup all nodes have internet access that is why I didn't really think about it.

I would say we mention that the nodes need internet access for all the steps somewhere in the documentation and leave it to the user to configure NAT or a proxy or whatever. That would be the easiest solution and would be acceptable for me. As we do not talk about network setup or network securing the nodes or the head node it sounds acceptable for me.

What do you think?

For our testing we actually set up a proxy server to reduce re-downloading of RPMs, so even with internet access we already change the network setup slightly.

tkucherera-lenovo · 2024-09-16T13:59:35Z

Having the nodes set up to access the internet also works for me.

Signed-off-by: tkucherera <[email protected]>

tkucherera-lenovo · 2024-09-18T15:14:29Z

@adrianreber l have made some changes to include the mentioned discussions

including adding epel-release and ohpc repo to nodes
installing nfs-utils on the computes
syncing /etc/hosts
fix documentation bugs

Note: The error you were getting with nfs.service not found could be that NFS is not installed on the master node. According to section 1.2 of the ohpc install guide, NFS is hosted on the master node, but I do not see in the guides, warewolf or xcat, where it is installed. Is it assumed that it is already installed? Please advise.

adrianreber reviewed Aug 7, 2024

View reviewed changes

adrianreber reviewed Aug 11, 2024

View reviewed changes

tkucherera-lenovo force-pushed the confluent_slurm branch 2 times, most recently from 83c0b50 to 64c8ca5 Compare September 18, 2024 13:29

update bugs: bos_image, basic_setup, shadow

10fe611

Signed-off-by: tkucherera <[email protected]>

tkucherera-lenovo force-pushed the confluent_slurm branch from b48b5a8 to 10fe611 Compare September 18, 2024 14:42

Rocky 9.4 Base Os confluent/Slurm Edition for linux #2002

Are you sure you want to change the base?

Rocky 9.4 Base Os confluent/Slurm Edition for linux #2002

Conversation

tkucherera-lenovo commented Aug 7, 2024

Assumptions

Note

adrianreber commented Aug 7, 2024

adrianreber commented Aug 7, 2024

adrianreber commented Aug 7, 2024

github-actions bot commented Aug 7, 2024 • edited Loading

Test Results

tkucherera-lenovo commented Aug 7, 2024

adrianreber Aug 7, 2024

Choose a reason for hiding this comment

tkucherera-lenovo Aug 7, 2024

Choose a reason for hiding this comment

adrianreber Aug 7, 2024

Choose a reason for hiding this comment

adrianreber commented Aug 7, 2024

adrianreber commented Aug 9, 2024

tkucherera-lenovo commented Aug 9, 2024

adrianreber commented Aug 9, 2024

tkucherera-lenovo commented Aug 9, 2024

adrianreber commented Aug 10, 2024

adrianreber commented Aug 11, 2024

tkucherera-lenovo commented Aug 11, 2024

adrianreber Aug 11, 2024

Choose a reason for hiding this comment

adrianreber commented Aug 11, 2024

adrianreber commented Aug 11, 2024

adrianreber commented Aug 12, 2024

adrianreber commented Aug 12, 2024

jjohnson42 commented Aug 13, 2024

jjohnson42 commented Aug 13, 2024

adrianreber commented Aug 13, 2024

jjohnson42 commented Aug 13, 2024

adrianreber commented Aug 13, 2024

adrianreber commented Aug 13, 2024

adrianreber commented Aug 13, 2024

jjohnson42 commented Aug 13, 2024

adrianreber commented Aug 13, 2024

jjohnson42 commented Aug 14, 2024

adrianreber commented Aug 14, 2024

jjohnson42 commented Aug 14, 2024

jjohnson42 commented Aug 28, 2024

tkucherera-lenovo commented Sep 16, 2024

adrianreber commented Sep 16, 2024

tkucherera-lenovo commented Sep 16, 2024

tkucherera-lenovo commented Sep 18, 2024

github-actions bot commented Aug 7, 2024 •

edited

Loading