Provide troubleshooting guidance, when "TASK [core/cluster : cluster | Create new cluster]" hanging #581

troppens · 2021-11-20T20:19:16Z

Describe the bug
I provisioned three VMs on virtual infrastructure and tried to create a three-node Spectrum Scale cluster.

The following step was hanging for an hour or so:

TASK [core/cluster : cluster | Create new cluster] *********************************************

I added a debug message to the core/cluster.yml:

    - debug:
        msg: "/usr/lpp/mmfs/bin/mmcrcluster -N /var/mmfs/tmp/NodeFile -C {{ scale_cluster_clustername }} {{ profile_type }} {{ extra_option }}"

    - name: cluster | Create new cluster
      command: /usr/lpp/mmfs/bin/mmcrcluster -N /var/mmfs/tmp/NodeFile -C {{ scale_cluster_clustername }} {{ profile_type }} {{ extra_option }}
      notify: accept-licenses
      register: mmcrcluster_results

In the next run of the playbook it gave me a hint:

TASK [core/cluster : debug] ******************************************************************************************************************************************************************
ok: [sc1-n1 -> sc1-n1] => {
    "msg": "/usr/lpp/mmfs/bin/mmcrcluster -N /var/mmfs/tmp/NodeFile -C gpfs1.local  "
}

So I tried this command without Ansible:

[root@sc1-n1 ~]# /usr/lpp/mmfs/bin/mmcrcluster -N /var/mmfs/tmp/NodeFile -C gpfs1.local
mmcrcluster: Performing preliminary node verification ...
mmcrcluster: Processing quorum and other critical nodes ...
The authenticity of host 'sc1-n3.fyre.ibm.com (10.11.22.161)' can't be established.
ECDSA key fingerprint is SHA256:2J35XBfRLzv5RUqHYH9rGCGA+jS1KR/Lw1f1+n0JbSU.
The authenticity of host 'sc1-n1.fyre.ibm.com (10.11.17.240)' can't be established.
ECDSA key fingerprint is SHA256:2J35XBfRLzv5RUqHYH9rGCGA+jS1KR/Lw1f1+n0JbSU.
The authenticity of host 'sc1-n2.fyre.ibm.com (10.11.22.160)' can't be established.
ECDSA key fingerprint is SHA256:2J35XBfRLzv5RUqHYH9rGCGA+jS1KR/Lw1f1+n0JbSU.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

Ah. SSH is not set up properly. For a new user this is not easy to determine, although this is mentioned in the README.

To improve usability, it would be good to have an additional check in the role for ssh connectivity to make the troubleshooting easier for new users.

I have also considered to have a section Troubleshooting in the README, though a check in the role would be preferred.

To Reproduce
Steps to reproduce the behavior:

Provision three VMs
Follow instructions in README and run ansible-playbook -i hosts playbook.yml

Expected behavior
Described above.

Environment
Please run the following an paste your output here:

Spectrum Scale 5.1.2.1
Ansible from EPEL
Current version of this project

The text was updated successfully, but these errors were encountered:

rajan-mis · 2021-11-21T04:26:21Z

Thanks @troppens . agree. we have a pre-check in the cli toolkit though mmnetverify. we were discussing if we can validate ansible host inventory through mmnetverify but mmnetveify is internal tools that we can't use here as its not open source tool.

We can add this info in the README , so that user can run manual mmnetverify before starting the scale cluster creation.

I will also check if we can add some ssh ansible module to validate host inventory. Thanks

troppens · 2021-11-22T08:37:33Z

mmnetverify is externally available:
https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmnetverify-command

Regarding ssh, I believe that this is required from admin node to all other nodes, but not from each node to all other nodes. From an ease-of-use perspective it might be desired to have ssh from any node to any other node. This would be good for new users just get started with Spectrum Scale, e.g. for evaluation or demo. From an production perspective it might be desired to restrict ssh to improve security.

acch · 2021-11-24T16:04:06Z

A few thoughts on this one:

I like the idea of printing stdout/stderr of the underlying mm* commands in human-readable format in case of errors. This is especially true for essential commands like mmcrcluster, mmcrnsd, mmcrfs, etc.
I've repeatedly seen people stumble over missing SSH keys, fingerprints, etc. Esp. for new users getting started with Scale this is always confusing. We have all the necessary logic (in core_prepare) to configure SSH and exchange keys... but it's all disabled by default. Most users might not even be aware of these vars:
- scale_prepare_enable_ssh_login
- scale_prepare_exchange_keys

@rajan-mis: Wouldn't it make sense to change the default of these to true so that it "just works", esp. for new users just getting started?

troppens · 2021-11-24T18:00:41Z

I missed the prep roles too, although I know that ssh must be configured. I was curious to see what breaks when I work with the OS and the role defaults ;-)

I would not change ssh settings per default because they impact security.

I believe it is sufficient to do a quick check on the node which executes mmcrcluster. From there just do an ssh to each Spectrum Scale node used in the mmcrcluster. Add a timer so that the command fails after a few seconds or one minute at most. This provides fast feedback to users of the Spectrum Scale roles and it implicitly teaches Spectrum Scale newbies to plan for ssh prerequisites.

acch added Component: Core Type: Enhancement Type: Enhancement Indicates a request for feature to be improved. labels Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide troubleshooting guidance, when "TASK [core/cluster : cluster | Create new cluster]" hanging #581

Provide troubleshooting guidance, when "TASK [core/cluster : cluster | Create new cluster]" hanging #581

troppens commented Nov 20, 2021

rajan-mis commented Nov 21, 2021

troppens commented Nov 22, 2021

acch commented Nov 24, 2021

troppens commented Nov 24, 2021

Provide troubleshooting guidance, when "TASK [core/cluster : cluster | Create new cluster]" hanging #581

Provide troubleshooting guidance, when "TASK [core/cluster : cluster | Create new cluster]" hanging #581

Comments

troppens commented Nov 20, 2021

rajan-mis commented Nov 21, 2021

troppens commented Nov 22, 2021

acch commented Nov 24, 2021

troppens commented Nov 24, 2021