Flexible SLURM configuration #339

ashiklom · 2024-05-09T14:36:14Z

Refactor SLURM directives
Support model-specific SLURM directives
Add SLURM unit tests

Description

Support passing arbitrary SLURM configurations via ~~experiment.yaml.~~ new option -- -s <slurmfile.yaml> (e.g., swell create 3dvar -s myslurm.yaml.

The general idea is that more specific and explicit configurations always override generic and implicit (hard-coded) values.
Any values set in the ~~experiment.yaml~~ <slurmfile.yaml> override corresponding hard-coded defaults in the source code, so the ~~experiment.yaml~~ <slurmfile.yaml> can be used to exert complete control over every possible SLURM directive.

The structure of the ~~experiment.yaml~~ <slurmfile.yaml> SLURM directives is as follows:

# slurmfile.yaml

# Global user-specified default values. These apply to all tasks, and override
# all hard-coded values (including task-specific ones).
slurm_directives_global:
  account: x1234
  nodes: 1

# Task-specific settings. These always override the globals (above) and any
# hard-coded values.
slurm_directives_tasks:
  RunGEOSExecutable:
    # Values specific to this task for *all* models.
    all:
      nodes: 2   # <-- This implies nodes=2 for RunGEOSExecutable-geos_ocean
    # Overrides specifically for `RunGEOSExecutable-geos_atmosphere`
    geos_atmosphere:
      nodes: 4

Note also: This also allows any valid (or invalid!) SLURM directive to be passed to the configuration.

Inheritance logic and default values are embedded in the src/swell/utilities/slurm.py file.
The current way to add new default configurations is to modify that file.

Dependencies

None

Impact

Closes #308 and #325.

Dooruk

Not sure if it was "swell create ready but" SLURM needs all of the original attributes it seems. After running swell create 3dvar, I received the following error:

jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'execution_time_limit'

I really like the utility here, this will be a major update to current setup!

Is the idea to host some of these default dictionaries under Swell/platforms (Discover SLES15 vs. SLES12, or even AWS etc.) or just host everything under swell-slurm.yaml externally and keep hard coded ones in the slurm.py?

ashiklom · 2024-05-09T18:07:37Z

Sorry, this PR was a bit premature. There are still a few bugs I need to fix.

I'll drop the "Draft" tag once this is ready for formal review (though comments are welcome at any time!).

Is the idea to host some of these default dictionaries under Swell/platforms (Discover SLES15 vs. SLES12, or even AWS etc.) or just host everything under swell-slurm.yaml externally and keep hard coded ones in the slurm.py?

We should think through this for future development. The most obvious next step is to add Questionary questions for SLURM customization, and then maybe to add some Questionary presets for common cases. Now that all of the SLURM logic is in Python, we can do lots of sophisticated stuff with hueristics based on the experiment configuration, automatic detection of the current system (and maybe even customization based on current system load), etc.

ashiklom · 2024-05-10T20:19:17Z

OK, I think this is ready for review. NOTE that this initial pass is a little bit hacky in its implementation:

I added a new -s / --slurm somefile.yaml option to swell create that reads the SLURM config from a special YAML file. (This part of the implementation I'm OK with).
Currently, the SLURM information from (1) is appended to the experiment_dict (and ultimately, to experiment.yaml) as a separate step after all the logic producing experiment.yaml. That might be OK, at least for now, but it feels kludgy. We should either
a. Commit to storing SLURM stuff in experiment_dict / experiment.yaml and add some Questionary stuff around it; or...
b. Commit to SLURM stuff being a completely separate file/object and pass it around separately in the SWELL code.

I'm slightly inclined towards (2) here, not least because it seems trying to formulate this in terms of Questionary prompts sounds really complicated, and because SLURM settings seem more like a user- and platform-specific configuration that people might want to manage separately (e.g., gitignore) from the more shareable experiment.yaml/overrides.yaml. But you all have to use this --- what do you think?

out of date comments

Dooruk

Thanks for working on this @ashiklom! This is a good step that would allow us to implement layout and model resolution dependencies and set defaults for them.

I'm voting for 2.b) Commit to SLURM stuff being a completely separate file/object and pass it around separately in the SWELL code.

I think Questionary (for JEDI config) and slurm.yamls require different logic.

Minor comments while testing, using swell create without the -s flag created slurm directives missing some of the important components not defined in global defaults (nodes and ntasks per_node). So some slurm tasks failed as a result. I guess the fix will be contingent upon how we want to proceed with storing slurm.yaml.

another very minor comment, --job-name used to have model names, e.g., EvaObservations-geos_ocean, it would be nice to keep that if possible. Not super critical though.

src/swell/utilities/slurm.py

ashiklom · 2024-05-13T19:31:57Z

I'm voting for 2.b) Commit to SLURM stuff being a completely separate file/object and pass it around separately in the SWELL code.

Agreed!

Minor comments while testing, using swell create without the -s flag created slurm directives missing some of the important components not defined in global defaults (nodes and ntasks per_node). So some slurm tasks failed as a result. I guess the fix will be contingent upon how we want to proceed with storing slurm.yaml.

Thanks for catching this. I'll look for where these were defined previously and try to add them back in. The -s slurm.yaml should be completely optional and this PR shouldn't break old workflows, so I'll need to fix that before this gets merged.

Can you point me to which specific tasks need this?

another very minor comment, --job-name used to have model names, e.g., EvaObservations-geos_ocean, it would be nice to keep that if possible. Not super critical though.

This should work (https://github.com/GEOS-ESM/swell/pull/339/files#diff-68c8167035a32267716d03f8e6a0ce35fe12d2368732e07fabf202cc008a9e74R118-R120). Not sure why that's being ignored. I'll investigate.

Dooruk · 2024-05-13T19:39:38Z

Can you point me to which specific tasks need this?

As far as I'm aware, every SLURM task needs node information. Tasks per node might be optional but that goes hand to hand with the model layout. If MOM6 uses a 24x30 layout, I dedicate 20 notes and 36 tasks per node.

This should work (https://github.com/GEOS-ESM/swell/pull/339/files#diff-68c8167035a32267716d03f8e6a0ce35fe12d2368732e07fabf202cc008a9e74R118-R120). Not sure why that's being ignored. I'll investigate.

Hmm, you are right. wonder if - has a specific meaning in templating?

ashiklom · 2024-05-13T23:13:07Z

When I run swell create 3dvar, I get flow.cylc files that set the job-name correctly (with both task name and model component); for example:

    [[EvaObservations-geos_ocean]]
        script = "swell task EvaObservations $config -d $datetime -m geos_ocean"
        platform = nccs_discover
        execution time limit = PT30M
        [[[directives]]]
            --job-name = EvaObservations-geos_ocean
            --account = g0613
            --qos = allnccs
            --partition = None
            --nodes = 1
            --ntasks_per_node = 24
            --constraint = cas|sky

Can you confirm that you're not getting this, and if so, send me the command you're running so I can debug?

- Read global and task-specific directives from experiment YAML - Replace hard-coded [[[directives]]] field with Jinja loop for more flexibility NOTE: Custom interactions between SWELL tasks and models (e.g., EvaObservations-geos_atmosphere) is still unresolved. Will need to think about this design. See GEOS-ESM#308 and GEOS-ESM#325

Add support for model-specific task directives (e.g., `EvaObservations-geos_atmosphere`). The new `experiment.yaml` structure is as follows: ```yaml slurm_directives_globals: account: x1234 nodes: 1 slurm_directives_tasks: RunGEOSExecutable: # Defaults for this task for all models all: nodes: 2 # Overrides for `RunGEOSExecutable-geos_atmosphere` geos_atmosphere: nodes: 4 ``` See GEOS-ESM#308 and GEOS-ESM#325.

Additional option `-s somefile.yaml` for specifying additional SLURM directives. Note that the implementation here adds the SLURM overrides to the internal `experiment_dict` (and, ultimately, the `experiment.yaml`) with a bit of error checking to make sure we _only_ use this for SLURM directives. **This is a bit hacky** --- we should probably either keep this information completely separate from the `experiment_dict`, or make it a proper first-class citizen, with Questionary support, etc.

Dooruk · 2024-05-14T13:50:28Z

Ok I tested it just now again, the slurm override functionality works. And the default produced:

    [[GenerateBClimatology-geos_ocean]]
        script = "swell task GenerateBClimatology $config -d $datetime -m geos_ocean"
        platform = nccs_discover
        execution time limit = PT1H
        [[[directives]]]
            --job-name = GenerateBClimatology-geos_ocean
            --account = g0613
            --qos = advda
            --partition = None
            --nodes = 1
            --ntasks_per_node = 24
            --constraint = cas|sky

    [[RunJediVariationalExecutable-geos_ocean]]
        script = "swell task RunJediVariationalExecutable $config -d $datetime -m geos_ocean"
        platform = nccs_discover
        execution time limit = PT1H
        [[[directives]]]
            --job-name = RunJediVariationalExecutable-geos_ocean
            --account = g0613
            --qos = advda
            --partition = None
            --nodes = 3
            --ntasks_per_node = 36
            --constraint = cas|sky

Dooruk

Ok, I caught two things:

It is ntasks-per-node not ntasks_per_node
So looks like partition needs special handling. My Slurm submission failed because None is invalid. Below is from the original code, partition is only written in the flow.cylc if it exists. Perhaps we can add a final check before returning target_dict and erase partition keys if they are None? Idk if there is a better solution.

{% if scheduling["BuildGeos"]["partition"]
--partition={{scheduling["BuildGeos"]["partition"]}}
{% endif %}

Dooruk · 2024-05-15T19:20:36Z

I had to make changes to ntasks-per-node locally during my runs on another branch and I just noticed that original swell_create_experiment.py also has it as ntasks_per_node. Apparently, that gets accessed to ntasks-per-node through jinja2 handling.

From the original code:

            --ntasks-per-node={{scheduling["BuildJedi"]["ntasks_per_node"]}}

GEOS-ESM#339 (review)

ashiklom · 2024-05-16T18:54:47Z

@Dooruk Thanks for catching all of that! Sorry for the sloppiness. I'll try to test things more thoroughly before submitting next time. You will have to show me your preferred approach to running some quick SWELL tests without launching massive GEOS simulations, so I can catch some of these things myself before submitting PRs.

partition issue is addressed in c670067 --- I just removed it from the global defaults. nodes-per-task fixed in 1890a54, and I added a validation step in a4d9a73.

Dooruk · 2024-05-16T20:08:40Z

@Dooruk Thanks for catching all of that!

Happy to help. I ran your branch (after making a local copy) via Tier 1 tests and some failed, swell code tests are also failing. Let's stay on after the CI meeting tomorrow and I can walk you through what I do briefly.

https://github.com/GEOS-ESM/swell/actions/runs/9118250195

Dooruk · 2024-05-28T20:43:46Z

Hi @ashiklom , I just ran this branch again and tier1 tests worked this time. I didn't touch anything so perhaps something was messed up with the github runners the previous time?

The code tests are failing though, so since this is on your personal repo you need to fix that, if you don't mind. Thanks!

ashiklom · 2024-05-30T19:18:10Z

tier1 tests worked this time

I think this is because the previous tests were broken by a JEDI update (hence #345) that has since been resolved.

The code tests are failing though

My unit tests worked as expected! I just forgot to update them. But good to know they throw errors in situations when there should be an error! Should be fixed now.

Dooruk

Ok, tier 1 tests work, we are really close.

model_components is not in every experiment dict. so that is the final issue before we merge this, my mod should fix it.

src/swell/utilities/slurm.py

Dooruk · 2024-06-01T18:40:15Z

Thanks, all looks good!

ashiklom requested review from Dooruk and rtodling May 9, 2024 14:47

ashiklom self-assigned this May 9, 2024

ashiklom force-pushed the slurm branch from c74e356 to 278c25d Compare May 9, 2024 14:51

ashiklom marked this pull request as draft May 9, 2024 14:57

Dooruk previously requested changes May 9, 2024

View reviewed changes

ashiklom marked this pull request as ready for review May 10, 2024 20:12

Dooruk self-requested a review May 13, 2024 17:54

Dooruk reviewed May 13, 2024

View reviewed changes

src/swell/utilities/slurm.py Show resolved Hide resolved

ashiklom added 12 commits May 13, 2024 19:17

Add SLURM unit tests

c79bdee

Fix style errors

1a51e73

BUGFIX: Inconsistent names of global slurm configs

efd775d

Error on SLURM directives for non-SLURM tasks

efc2d0f

Fix inheritance for model-specific SLURM configs

2472bab

Fix SLURM unit tests

d0f9479

Set default execution_time_limit for all tasks

88bd6c3

Enable SLURM code_tests

d168412

Add defaults for nodes and ntasks_per_node

60c214f

ashiklom force-pushed the slurm branch from 69cb0a3 to 60c214f Compare May 13, 2024 23:17

Dooruk requested changes May 15, 2024

View reviewed changes

ashiklom added 2 commits May 16, 2024 14:23

BUGFIX: Slurm - ntasks_per_node -> ntasks-per-node

1890a54

Validate SLURM directive names

a4d9a73

ashiklom force-pushed the slurm branch from 6382475 to a4d9a73 Compare May 16, 2024 18:44

ashiklom added 2 commits May 16, 2024 14:50

Fix PEP-8 style; add comments

d056038

Remove partition=None from SLURM config

c670067

GEOS-ESM#339 (review)

SLURM: Fix ntasks_per_node in tests

28ba487

Dooruk reviewed May 31, 2024

View reviewed changes

src/swell/utilities/slurm.py Outdated Show resolved Hide resolved

SLURM: model_components may not be in experiment

2db2741

Dooruk self-requested a review June 1, 2024 18:39

Dooruk approved these changes Jun 1, 2024

View reviewed changes

Dooruk merged commit b3995db into GEOS-ESM:develop Jun 1, 2024
22 checks passed

ashiklom mentioned this pull request Jun 14, 2024

SE: Store documentation alongisde code #357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible SLURM configuration #339

Flexible SLURM configuration #339

ashiklom commented May 9, 2024 •

edited

Loading

Dooruk left a comment

ashiklom commented May 9, 2024

ashiklom commented May 10, 2024

Dooruk left a comment

ashiklom commented May 13, 2024

Dooruk commented May 13, 2024

ashiklom commented May 13, 2024

Dooruk commented May 14, 2024

Dooruk left a comment

Dooruk commented May 15, 2024 •

edited

Loading

ashiklom commented May 16, 2024 •

edited

Loading

Dooruk commented May 16, 2024

Dooruk commented May 28, 2024

ashiklom commented May 30, 2024 •

edited

Loading

Dooruk left a comment

Dooruk commented Jun 1, 2024

Flexible SLURM configuration #339

Flexible SLURM configuration #339

Conversation

ashiklom commented May 9, 2024 • edited Loading

Description

Dependencies

Impact

Dooruk left a comment

Choose a reason for hiding this comment

ashiklom commented May 9, 2024

ashiklom commented May 10, 2024

Dooruk left a comment

Choose a reason for hiding this comment

ashiklom commented May 13, 2024

Dooruk commented May 13, 2024

ashiklom commented May 13, 2024

Dooruk commented May 14, 2024

Dooruk left a comment

Choose a reason for hiding this comment

Dooruk commented May 15, 2024 • edited Loading

ashiklom commented May 16, 2024 • edited Loading

Dooruk commented May 16, 2024

Dooruk commented May 28, 2024

ashiklom commented May 30, 2024 • edited Loading

Dooruk left a comment

Choose a reason for hiding this comment

Dooruk commented Jun 1, 2024

ashiklom commented May 9, 2024 •

edited

Loading

Dooruk commented May 15, 2024 •

edited

Loading

ashiklom commented May 16, 2024 •

edited

Loading

ashiklom commented May 30, 2024 •

edited

Loading