Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplication of nodes in NodeName #172

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jovial
Copy link
Contributor

@jovial jovial commented Nov 8, 2024

Prior to this change slurmctld would fail to start if adding nodes to multiple partitions e.g:

openhpc_slurm_partitions:
    - name: allnodes
      groups:
        - name: alfa
        - name: beta
      partition_params:
        Priority: 50
      default: YES
    - name: alfa
      partition_params:
        Priority: 100
    - name: bravo
      partition_params:
        Priority: 100

This change tracks all nodes that already have a NodeName entry and will not add again.

Prior to this change slurmctld would fail to start if adding nodes
to multiple partitions e.g:

    openhpc_slurm_partitions:
        - name: allnodes
          groups:
            - name: alfa
            - name: beta
          partition_params:
            Priority: 50
          default: YES
        - name: alfa
          partition_params:
            Priority: 100
        - name: bravo
          partition_params:
            Priority: 100

This change tracks all nodes that already have a NodeName entry and will
not add again.
@jovial jovial requested a review from a team as a code owner November 8, 2024 15:04
@jovial
Copy link
Contributor Author

jovial commented Nov 8, 2024

This does not require using filters outside of the role like the workaround we have previously used:

# As per the docs:
# - A partition entry may contain groups. If it doesn't the partition itself is the group.
# - Nodes can only be in one group (else slurmctld startup fails with "fatal: Duplicated NodeName"
#   due to the templating approach. Howver $group.partition_params.Nodes can be used to
#   override the automatic definition of Node=.
# - To avoid having to explicitly list the nodes, we can use the hostlist_expression
#   filter provided by the stackhpc.openhpc role. This works as this expression is
#   evaluated inside the role. NB: this filter returns a list.
# - Have to handle the GPU nodes not existing in staging.
_general_nodes:
  - "{{ groups[openhpc_cluster_name + '_general'] | hostlist_expression | join(',') }}"
  - "{{ groups.get(openhpc_cluster_name + '_a40_1024', []) | hostlist_expression | join(',') }}"
  - "{{ groups.get(openhpc_cluster_name + '_a40_512', []) | hostlist_expression | join(',') }}"

openhpc_slurm_partitions:
  - name: general # all nodes except grid - this templates the actual ${cluster_name}-general nodes ...
    partition_params:
      Nodes: "{{ _general_nodes | join(',') }}" # .. this adds the other nodes
    default: YES
  - name: gpu # all GPU nodes
    groups:
      - name: a40_1024
        gres:
          - conf: gpu:A40:2
            file: /dev/nvidia[0-1]
      - name: a40_512
        gres:
          - conf: gpu:A40:2
            file: /dev/nvidia[0-1]
    default: NO

@sjpb
Copy link
Collaborator

sjpb commented Nov 8, 2024

Can someone add a molecule test for this specific case please?

Copy link
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a) a molecule test and b) docs changes to match up with code behaviour changes please.

Copy link

@jianguo-ukaea jianguo-ukaea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have written a molecule test named test15 and have pushed the revision to my forked repo: https://github.com/jianguo-ukaea/ansible-role-openhpc.git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants