ansible for create slurm cluster fails
The ansible script to create the ophc master exists with non-zero status when creating the slurm cluster . This happens after the compute node is provisioned and operational. The ansible playbook continues after it's 150 second pause waiting for the compute node to come up. It tries to create the slurm cluster but then complains that a cluster by that name already exists.
ohpc: TASK [compute_build_nodes : add nodes via wwnodescan - BOOT NODES NOW, IN ORDER] ***
ohpc: changed: [ohpc]
ohpc:
ohpc: TASK [compute_build_nodes : set files to provision] ****************************
ohpc: changed: [ohpc]
ohpc:
ohpc: TASK [compute_build_nodes : sync files] ****************************************
ohpc: changed: [ohpc]
ohpc:
ohpc: TASK [compute_build_nodes : restart dhcp] **************************************
ohpc: changed: [ohpc]
ohpc:
ohpc: TASK [compute_build_nodes : update pxeconfig to force node to boot from pxe] ***
ohpc: changed: [ohpc]
ohpc:
ohpc: TASK [compute_build_nodes : update pxeconfig to let node boot from local disk] ***
ohpc: skipping: [ohpc]
ohpc:
ohpc: TASK [compute_build_nodes : wwsh pxe update] ***********************************
ohpc: changed: [ohpc]
ohpc:
ohpc: TASK [nodes_vivify : Waiting for the compute node to bootup] *******************
ohpc: Pausing for 150 seconds
ohpc: (ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
ohpc: ok: [ohpc]
ohpc:
ohpc: TASK [nodes_vivify : create cluster using sacctmgr] ****************************
ohpc: fatal: [ohpc]: FAILED! => {"changed": true, "cmd": "sacctmgr create cluster xcbc-example -i", "delta": "0:00:00.204317", "end": "2018-09-20 22:44:01.418461", "msg": "non-zero return code", "rc": 1, "start": "2018-09-20 22:44:01.214144", "stderr": "", "stderr_lines": [], "stdout": " This cluster xcbc-example already exists. Not adding.", "stdout_lines": [" This cluster xcbc-example already exists. Not adding."]}
ohpc: to retry, use: --limit @/vagrant/CRI_XCBC/site.retry
ohpc:
ohpc: PLAY RECAP *********************************************************************
ohpc: ohpc : ok=86 changed=78 unreachable=0 failed=1
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.