Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rocky 9.4 Base Os confluent/Slurm Edition for linux #2002

Open
wants to merge 1 commit into
base: 3.x
Choose a base branch
from

Conversation

tkucherera-lenovo
Copy link

This is a recipe that uses confluent for cluster provisioning.

Assumptions

  1. DNS is setup
  2. There is at least one SSH key on the SMS (the key is used for passwordless login on the nodes)

Note

  1. The makerepo.sh file does not check for Rocky Linux so had to modify it to check if os is Rocky
  2. the ohpc repo and epel repos directories are in /var/lib/confluent/public whereas on the compute nodes they can be reached via web root confluent-public

@adrianreber
Copy link
Member

Thanks, this is great. I will try it out on our CI systems.

@adrianreber
Copy link
Member

1. DNS is setup

We have /etc/hosts. I hope that is enough.

2. There is at least one SSH key on the SMS (the key is used for passwordless login on the nodes)

That is also needed for all other recipes. So, no problem.

@adrianreber
Copy link
Member

The resulting RPMS can be found in the GitHub Actions for the next 24 hours.

Copy link

github-actions bot commented Aug 7, 2024

Test Results

14 files  ±0  14 suites  ±0   6s ⏱️ ±0s
41 tests ±0  41 ✅ ±0  0 💤 ±0  0 ❌ ±0 
54 runs  ±0  54 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 8f82c3e. ± Comparison against base commit 8e59d8c.

♻️ This comment has been updated with latest results.

@tkucherera-lenovo
Copy link
Author

Yes, /etc/hosts should be enough

\input{common/install_ohpc_components_intro}

\subsection{Enable \OHPC{} repository for local use} \label{sec:enable_repo}
\input{common/enable_local_ohpc_repo_confluent}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of the history behind this line from the xcat recipe.In all other recipes we enable the OpenHPC repository by installing the OpenHPC release RPM which enabled a dnf repository from the OpenHPC repository server. Hardcoding the downloading of the repository tar files feels unnecessary especially as we do not do it at all in any of our current testing. Please try to work with the online repository if that would work for you.

If you need it for your testing we should put it behind some variable, so that it can be disabled.

Is this strictly necessary for you or can you work with the online repository?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure noted, l will look to work with the online repo.

\subsubsection{Build initial BOS image} \label{sec:assemble_bos}
The following steps illustrate the process to build a minimal, default image for use with \Confluent{}. To begin, you will
first need to have a local copy of the ISO image available for the underlying OS. In this recipe, the relevant ISO image
is \texttt{Rocky-9.4-x86\_64-dvd1.iso} (available from the Rocky
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image I downloaded does not have a "1" in the file name. The filename should be a variable so that it can be easily updated.

@adrianreber
Copy link
Member

The main point which is currently not clear to me is if Confluent comes with a DHCP server? I was running the script a couple of times and the two compute nodes were always waiting for DHCP answers in the PXE boot step of the firmware.

@adrianreber
Copy link
Member

@tkucherera-lenovo Should I try again? Is there now a DHCP server configured, somehow?

For the final merge you can squash the commits. For the main repository it makes no sense to keep your development history with fixups. If you want you do separate commits for the docs/ part and the components/ part, that would make sense to me.

Please also add a Signed-off-by to your commit messages as described in https://github.com/openhpc/ohpc/blob/3.x/CONTRIBUTING.md. git commit -s usually does that automatically.

@tkucherera-lenovo
Copy link
Author

Yes, you can try again. Confluent does have its own dhcp server and by default it will respond to DHCP requests. If an environment has its own DHCP server, it is possible to configure confluent to not respond to DHCP requests. In this case though l believe there was a bug where the setting for allowing deployment using pxe was not being set because the variable needed was missing from the input.local file l have added a fix for that now.

going forward l will squash all commits and also add the signed-off-by to commits

@adrianreber
Copy link
Member

@tkucherera-lenovo Is there an easy way to reset the host machine without reinstalling. Where does confluent store its state? Is there a directory I can delete to start from scratch?

@tkucherera-lenovo
Copy link
Author

The state is stored /etc/confluent/*. So stopping confluent and running rm -rf /etc/confluent/*. l would also recommend removing the os profile under /var/lib/confluent/public/os dir

@adrianreber
Copy link
Member

Now I see that the compute nodes are trying to boot:

==> audit <==
Aug 10 10:43:07 {"operation": "update", "target": "/noderange/compute/boot/nextdevice", "allowed": true}
Aug 10 10:43:12 {"operation": "update", "target": "/noderange/compute/power/state", "allowed": true}

==> events <==
Aug 10 10:46:16 {"info": "Offering PXE boot with static address 10.241.58.133 to c2"}
Aug 10 10:46:18 {"info": "Offering PXE boot with static address 10.241.58.132 to c1"}
Aug 10 10:46:25 {"info": "Offering PXE boot with static address 10.241.58.133 to c2"}
Aug 10 10:46:28 {"info": "Offering PXE boot with static address 10.241.58.132 to c1"}

==> /var/log/httpd/access_log <==
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot.ipxe HTTP/1.1" 200 227 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/kernel HTTP/1.1" 200 13605704 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/addons.cpio HTTP/1.1" 200 97792 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/site.cpio HTTP/1.1" 200 3072 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.133 - - [10/Aug/2024:10:46:32 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/distribution HTTP/1.1" 200 106800744 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:34 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot.ipxe HTTP/1.1" 200 227 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:34 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/kernel HTTP/1.1" 200 13605704 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:35 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/addons.cpio HTTP/1.1" 200 97792 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:35 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/site.cpio HTTP/1.1" 200 3072 "-" "iPXE/1.21.1 (g988d2)"
10.241.58.132 - - [10/Aug/2024:10:46:35 +0000] "GET /confluent-public/os/rocky-9.4-x86_64-default/boot/initramfs/distribution HTTP/1.1" 200 106800744 "-" "iPXE/1.21.1 (g988d2)"

But after that nothing seems to happen. On the console I see:

image

Any recommendations how to continue?

Also there seems to be no point during the installation where the script waits for the compute nodes to be ready, so most commands are run when the compute nodes are not available. All the customization fails with:

+ nodeshell compute echo '"10.241.58.134:/home' /home nfs nfsvers=3,nodev,nosuid 0 '0"' '>>' /etc/fstab
c1: ssh: connect to host c1 port 22: No route to host
c2: ssh: connect to host c2 port 22: No route to host

@adrianreber
Copy link
Member

Now the installation is working but now it fails in post-installation scripts. I see on the server following error:

Aug 11 09:04:06 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 193, in sync_list_to_node
    sshutil.prep_ssh_key('/etc/confluent/ssh/automation')
  File "/opt/confluent/lib/python/confluent/sshutil.py", line 139, in prep_ssh_key
    subprocess.check_output(['ssh-add', keyname], stdin=devnull, stderr=devnull)
  File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ssh-add', '/etc/confluent/ssh/automation']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 520, in handle_request
    result = syncfiles.start_syncfiles(
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in start_syncfiles
    syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmpf19vlfb1/etc/shadow
Aug 11 09:04:07 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 193, in sync_list_to_node
    sshutil.prep_ssh_key('/etc/confluent/ssh/automation')
  File "/opt/confluent/lib/python/confluent/sshutil.py", line 139, in prep_ssh_key
    subprocess.check_output(['ssh-add', keyname], stdin=devnull, stderr=devnull)
  File "/usr/lib64/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib64/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ssh-add', '/etc/confluent/ssh/automation']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 520, in handle_request
    result = syncfiles.start_syncfiles(
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 321, in start_syncfiles
    syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmp2qhhryvt/etc/shadow

@tkucherera-lenovo
Copy link
Author

Hi, Adrian l don't know what state the management server and cluster are in. But usually, the error that l seeing happens when the automation SSH key is missing from the /etc/confluent/ssh directory. This key should have been created during the osdeploy initialize step. The input.local file should have an initialize_options variable with the value usklpta where the a option creates the key in question.

Additionally just to be able to help me with debug. If you the command:

  1. confluent_selfcheck -n <nodename>

That output is sometimes helpful in debug. Thanks.

[sms](*\#*) mkdir -p $epel_repo_dir_confluent
[sms](*\#*) (*\install*) dnf-plugins-core createrepo
# Download required EPEL packages
[sms](*\#*) dnf download --destdir $epel_repo_dir_confluent fping libconfuse libunwind
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strange, why don't we just enable EPEL on the compute nodes?

@adrianreber
Copy link
Member

Hi, Adrian l don't know what state the management server and cluster are in. But usually, the error that l seeing happens when the automation SSH key is missing from the /etc/confluent/ssh directory. This key should have been created during the osdeploy initialize step. The input.local file should have an initialize_options variable with the value usklpta where the a option creates the key in question.

I just copied usklpt without the a. Retrying with the additional a now.

@adrianreber
Copy link
Member

Now the compute nodes are provisioned, but I cannot login:

# confluent_selfcheck -n c1
OS Deployment: Initialized
Confluent UUID: Consistent
Web Server: Running
Web Certificate: OK
Checking web download: Failed to download /confluent-public/site/confluent_uuid
Checking web API access: Failed access, if selinux is enabled, `setsebool -P httpd_can_network_connect=1`, otherwise check web proxy configuration
TFTP Status: OK
SSH root user public key: OK
Checking SSH Certificate authority: OK
Checking confluent SSH automation key: OK
Checking for blocked insecure boot: OK
Checking IPv6 enablement: OK
Performing node checks for 'c1'
Checking node attributes in confluent...
Checking network configuration for c1
c1 appears to have network configuration suitable for IPv4 deployment via: ens2f0
No issues detected with attributes of c1
Checking name resolution: OK

Using warewulf 3 provisioning the ssh keys from /root/.ssh are automatically part of the compute nodes and ssh works. Can confluent also use one of those existing keys and add it to the compute node?

Also, the current recipe does not wait until the compute nodes are provisioned. It immediately continues and all commands like nodeshell fail, because the provisioning is not finished.

@adrianreber
Copy link
Member

Ah, so the problem is, is that I have SSH keys in different formats and the last in the list is using an unsupported algorithm.

In /opt/confluent/lib/python/confluent/sshutil.py all SSH keys are copied to the provisioning image, but instead of overwriting the previous key it would probably make more sense to append all keys.

Following code change seems to work for me:

--- /opt/confluent/lib/python/confluent/sshutil.py	2023-11-15 16:30:46.000000000 +0000
+++ /opt/confluent/lib/python/confluent/sshutil.py.new	2024-08-12 09:10:48.601474767 +0000
@@ -214,10 +214,14 @@
     else:
         suffix = 'rootpubkey'
     for auth in authorized:
-        shutil.copy(
-            auth,
+        local_key = open(auth, 'r')
+        dest = open(
             '/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
-                    myname, suffix))
+                    myname, suffix), 'a')
+        dest.write(local_key.read())
+    if os.path.exists(
+            '/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
+                myname, suffix)):
         os.chmod('/var/lib/confluent/public/site/ssh/{0}.{1}'.format(
                 myname, suffix), 0o644)
         os.chown('/var/lib/confluent/public/site/ssh/{0}.{1}'.format(

Instead of copying all the files and overwriting everything with the last file, this appends all public keys.

@adrianreber
Copy link
Member

Now SSH works, but provisioning fails again:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 526, in handle_request
    status, output = syncfiles.get_syncresult(nodename)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 356, in get_syncresult
    result = syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmp9t4o6x20/etc/shadow
Aug 12 10:48:09 Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 197, in sync_list_to_node
    output, stderr = util.run(
  File "/opt/confluent/lib/python/confluent/util.py", line 48, in run
    raise subprocess.CalledProcessError(retcode, process.args, output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['rsync', '-rvLD', '/tmp/tmpszubn5dq.synctoc2/', 'root@[10.241.58.133]:/']' returned non-zero exit status 23.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 612, in resourcehandler
    for rsp in resourcehandler_backend(env, start_response):
  File "/opt/confluent/lib/python/confluent/httpapi.py", line 635, in resourcehandler_backend
    for res in selfservice.handle_request(env, start_response):
  File "/opt/confluent/lib/python/confluent/selfservice.py", line 526, in handle_request
    status, output = syncfiles.get_syncresult(nodename)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 356, in get_syncresult
    result = syncrunners[nodename].wait()
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 181, in wait
    return self._exit_event.wait()
  File "/usr/lib/python3.9/site-packages/eventlet/event.py", line 132, in wait
    current.throw(*self._exc)
  File "/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/opt/confluent/lib/python/confluent/syncfiles.py", line 215, in sync_list_to_node
    raise Exception("Syncing failed due to unreadable files: " + ','.join(unreadablefiles))
Exception: Syncing failed due to unreadable files: /tmp/tmpw077yd0x/etc/shadow

It makes kind of sense, because /tmp/tmpw077yd0x/etc/shadow is indeed 000 but I am not sure what is going on, running the same rsync command as root works without errors.

Currently I am again stuck in provisioning:

# nodedeploy compute
c1: pending: rocky-9.4-x86_64-default
c2: pending: rocky-9.4-x86_64-default
# confluent_selfcheck -n c1
OS Deployment: Initialized
Confluent UUID: Consistent
Web Server: Running
Web Certificate: OK
Checking web download: Failed to download /confluent-public/site/confluent_uuid
Checking web API access: Failed access, if selinux is enabled, `setsebool -P httpd_can_network_connect=1`, otherwise check web proxy configuration
TFTP Status: OK
SSH root user public key: OK
Checking SSH Certificate authority: OK
Checking confluent SSH automation key: OK
Checking for blocked insecure boot: OK
Checking IPv6 enablement: OK
Performing node checks for 'c1'
Checking node attributes in confluent...
Checking network configuration for c1
c1 appears to have network configuration suitable for IPv4 deployment via: ens2f0
No issues detected with attributes of c1
Checking name resolution: OK

@jjohnson42
Copy link

Following code change seems to work for me:

Pull request is welcome for that one. It has come up but we didn't quite get around to appending keys when dealing with multiple /root/.ssh/*.pub keys. https://github.com/xcat2/confluent/pulls

@jjohnson42
Copy link

on the /etc/shadow issue, this is a consequence of confluent not being allowed to run as root, so for files like /etc/shadow, if that is desired, then you would need one readable by the confluent user. We frequently support doing /etc/passwd and 'stubbing out' shadow to be password disabled for accounts like that as an option.

@adrianreber
Copy link
Member

on the /etc/shadow issue, this is a consequence of confluent not being allowed to run as root, so for files like /etc/shadow, if that is desired, then you would need one readable by the confluent user. We frequently support doing /etc/passwd and 'stubbing out' shadow to be password disabled for accounts like that as an option.

How could this be best automated in a recipe like we are trying to build here? Any recommendations?

@jjohnson42
Copy link

I'd probably offer some example choices:
-Use 'Merge' support of /etc/passwd, do not include shadow. This will produce 'password disabled' instances of the users from passwd, for ssh key based access only
-Give confluent read access to /etc/shadow
-Make a blessed /etc/shadow copy for confluent to distribute
-Use a separate mechanism or invocation to push out /etc/shadow (e.g. nodersync manually run as the root user can do it).

I think we were imagining the first option, that sync targets aren't interested in the passwords.

Note that root password is a node attribute and can be set in the confluent db. The default is to disable root password unless specified. If set during deploy, it will get that root password into shadow (though before syncfiles run).

@adrianreber
Copy link
Member

Following code change seems to work for me:

Pull request is welcome for that one. It has come up but we didn't quite get around to appending keys when dealing with multiple /root/.ssh/*.pub keys. https://github.com/xcat2/confluent/pulls

xcat2/confluent#159

@adrianreber
Copy link
Member

I'd probably offer some example choices:

  • Use 'Merge' support of /etc/passwd, do not include shadow. This will produce 'password disabled' instances of the users from passwd, for ssh key based access only
  • Give confluent read access to /etc/shadow
  • Make a blessed /etc/shadow copy for confluent to distribute
  • Use a separate mechanism or invocation to push out /etc/shadow (e.g. nodersync manually run as the root user can do it).

I think we were imagining the first option, that sync targets aren't interested in the passwords.

Note that root password is a node attribute and can be set in the confluent db. The default is to disable root password unless specified. If set during deploy, it will get that root password into shadow (though before syncfiles run).

As this recipe is contributed by you (upstream confluent) I would let you decide how to design and implement it. Also with the proper warnings in the documentation. But whatever makes most sense for you. If the recipe results in a working cluster we are happy to include it. Maybe merge support makes sense as we do not use passwords anyway much (at all) or the blessed copy. I would defer this to you and your experience what makes most sense.

@adrianreber
Copy link
Member

With a chmod 644 /etc/shadow I have a workaround. We should still have a proper solution in the recipe to handle /etc/shadow.

Following things needs to be fixed at this point:

  • the recipe needs to wait until the compute nodes are ready
  • epel-release needs to be installed on the compute nodes
  • ohpc-release needs to be installed on the compute nodes

For warewulf we do:

export CHROOT=/opt/ohpc/admin/images/rocky9.3
wwmkchroot -v rocky-9 $CHROOT
dnf -y --installroot $CHROOT install epel-release
cp -p /etc/yum.repos.d/OpenHPC*.repo $CHROOT/etc/yum.repos.d

As confluent first does the installation and then changes the running compute node, this approach will not work.
For Rocky and AlmaLinux something like this will work:

# nodeshell compute dnf -y  install epel-release
# nodeshell compute dnf -y  install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm

The following commands are unnecessary or do not work:

# nodeshell compute dnf -y  install ntp
# nodeshell compute dnf -y  install  --enablerepo=powertools lmod-ohpc #powertools does not exist, it is called crb and already enabled earlier
# nodeshell compute systemctl restart nfs
c1: Failed to restart nfs.service: Unit nfs.service not found.
c2: Failed to restart nfs.service: Unit nfs.service not found.

This is needed: nodeshell compute dnf -y install nfs-utils

The existing /etc/hosts from the SMS is not synced to the compute nodes.

Besides the items mentioned here we seem to be able to get a cluster with two compute nodes running.

The nice thing for OpenHPC is that with this recipe we would finally have a stateful provisioned recipe again.

When we used to have a XCAT stateful recipe, it was explicitly marked to be stateful, not sure how you want to do this. Do you want to have one recipe which can either do stateful or stateless? Or two recipes?

@jjohnson42
Copy link

So if I'm understanding
Need to wait for nodedeploy to show:

 # nodedeploy r3u23
r3u23: completed: alma-9.4-x86_64-default

Changes to syncfiles to include:
/etc/hosts
/etc/yum.repos.d/OpenHPC*.repo $CHROOT/etc/yum.repos.d

And in post.d, to install epel-release

For nfs-utils, we could add it to the pkglist, or add a 'dnf -y install nfs-utils' as a 'post.d' script.

For diskless, maybe a different recipe. It will be more 'warewulf' like, with 'imgutil build' and 'imgutil exec'. There's also been a suggestion to make the 'installimage' script work for those instead of just clones.

@adrianreber
Copy link
Member

/etc/yum.repos.d/OpenHPC*.repo $CHROOT/etc/yum.repos.d

Either install the repo file, but this requires to also copy the keys, or install the ohpc-release RPM via dnf.

@jjohnson42
Copy link

@adrianreber To go back, did you want to do a pull request for the ssh key handling change, or did you want it done on your behalf? I kind of like the idea of the pull request to keep it clear who did what, but can just work it from your comment if preferred.

@adrianreber
Copy link
Member

@adrianreber To go back, did you want to do a pull request for the ssh key handling change, or did you want it done on your behalf? I kind of like the idea of the pull request to keep it clear who did what, but can just work it from your comment if preferred.

I already did at xcat2/confluent#159

@jjohnson42
Copy link

Thanks, sorry for not noticing sooner. I accepted and amended it just a tad (to empty out the file before writing, and using 'with' to manage open/close of the files.

@jjohnson42
Copy link

@adrianreber FYI, confluent 3.11.0 has been released including your change for ssh pubkey handling.

@tkucherera-lenovo
Copy link
Author

@adrianreber Since the compute nodes are provisioned without internet access running commands like nodeshell compute dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm would fail. Do you advise we set up a NAT gateway on the master node to give the computes access to the internet, or we follow what xcat recipe was doing which is locally setting up the ohpc repo copy and then configuring the repo which can be accessed by the computes via the web roots xcat would have set up see here:

# Add OpenHPC repo mirror hosted on SMS
[sms](*\#*) psh compute dnf config-manager --add-repo=http://$sms_ip/$ohpc_repo_dir/OpenHPC.local.repo
# Replace local path with SMS URL
[sms](*\#*) psh compute "perl -pi -e 's/file:\/\/\@PATH\@/http:\/\/$sms_ip\/"${ohpc_repo_dir//\//"\/"}"/s' \
        /etc/yum.repos.d/OpenHPC.local.repo"

@adrianreber
Copy link
Member

@adrianreber Since the compute nodes are provisioned without internet access running commands like nodeshell compute dnf -y install http://repos.openhpc.community/OpenHPC/3/EL_9/x86_64/ohpc-release-3-1.el9.x86_64.rpm would fail. Do you advise we set up a NAT gateway on the master node to give the computes access to the internet, or we follow what xcat recipe was doing which is locally setting up the ohpc repo copy and then configuring the repo which can be accessed by the computes via the web roots xcat would have set up see here:

# Add OpenHPC repo mirror hosted on SMS
[sms](*\#*) psh compute dnf config-manager --add-repo=http://$sms_ip/$ohpc_repo_dir/OpenHPC.local.repo
# Replace local path with SMS URL
[sms](*\#*) psh compute "perl -pi -e 's/file:\/\/\@PATH\@/http:\/\/$sms_ip\/"${ohpc_repo_dir//\//"\/"}"/s' \
        /etc/yum.repos.d/OpenHPC.local.repo"

Hmm, I see. In our test setup all nodes have internet access that is why I didn't really think about it.

I would say we mention that the nodes need internet access for all the steps somewhere in the documentation and leave it to the user to configure NAT or a proxy or whatever. That would be the easiest solution and would be acceptable for me. As we do not talk about network setup or network securing the nodes or the head node it sounds acceptable for me.

What do you think?

For our testing we actually set up a proxy server to reduce re-downloading of RPMs, so even with internet access we already change the network setup slightly.

@tkucherera-lenovo
Copy link
Author

Having the nodes set up to access the internet also works for me.

@tkucherera-lenovo tkucherera-lenovo force-pushed the confluent_slurm branch 2 times, most recently from 83c0b50 to 64c8ca5 Compare September 18, 2024 13:29
@tkucherera-lenovo
Copy link
Author

@adrianreber l have made some changes to include the mentioned discussions

  1. including adding epel-release and ohpc repo to nodes
  2. installing nfs-utils on the computes
  3. syncing /etc/hosts
  4. fix documentation bugs

Note: The error you were getting with nfs.service not found could be that NFS is not installed on the master node. According to section 1.2 of the ohpc install guide, NFS is hosted on the master node, but I do not see in the guides, warewolf or xcat, where it is installed. Is it assumed that it is already installed? Please advise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants