Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible SLURM configuration #339

Merged
merged 18 commits into from
Jun 1, 2024
Merged

Flexible SLURM configuration #339

merged 18 commits into from
Jun 1, 2024

Conversation

ashiklom
Copy link
Collaborator

@ashiklom ashiklom commented May 9, 2024

  • Refactor SLURM directives
  • Support model-specific SLURM directives
  • Add SLURM unit tests

Description

Support passing arbitrary SLURM configurations via experiment.yaml. new option -- -s <slurmfile.yaml> (e.g., swell create 3dvar -s myslurm.yaml.

The general idea is that more specific and explicit configurations always override generic and implicit (hard-coded) values.
Any values set in the experiment.yaml <slurmfile.yaml> override corresponding hard-coded defaults in the source code, so the experiment.yaml <slurmfile.yaml> can be used to exert complete control over every possible SLURM directive.

The structure of the experiment.yaml <slurmfile.yaml> SLURM directives is as follows:

# slurmfile.yaml

# Global user-specified default values. These apply to all tasks, and override
# all hard-coded values (including task-specific ones).
slurm_directives_global:
  account: x1234
  nodes: 1

# Task-specific settings. These always override the globals (above) and any
# hard-coded values.
slurm_directives_tasks:
  RunGEOSExecutable:
    # Values specific to this task for *all* models.
    all:
      nodes: 2   # <-- This implies nodes=2 for RunGEOSExecutable-geos_ocean
    # Overrides specifically for `RunGEOSExecutable-geos_atmosphere`
    geos_atmosphere:
      nodes: 4

Note also: This also allows any valid (or invalid!) SLURM directive to be passed to the configuration.

Inheritance logic and default values are embedded in the src/swell/utilities/slurm.py file.
The current way to add new default configurations is to modify that file.

Dependencies

None

Impact

Closes #308 and #325.

@ashiklom ashiklom requested review from Dooruk and rtodling May 9, 2024 14:47
@ashiklom ashiklom self-assigned this May 9, 2024
@ashiklom ashiklom marked this pull request as draft May 9, 2024 14:57
Dooruk
Dooruk previously requested changes May 9, 2024
Copy link
Collaborator

@Dooruk Dooruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it was "swell create ready but" SLURM needs all of the original attributes it seems. After running swell create 3dvar, I received the following error:

jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'execution_time_limit'

I really like the utility here, this will be a major update to current setup!

Is the idea to host some of these default dictionaries under Swell/platforms (Discover SLES15 vs. SLES12, or even AWS etc.) or just host everything under swell-slurm.yaml externally and keep hard coded ones in the slurm.py?

@ashiklom
Copy link
Collaborator Author

ashiklom commented May 9, 2024

Sorry, this PR was a bit premature. There are still a few bugs I need to fix.

I'll drop the "Draft" tag once this is ready for formal review (though comments are welcome at any time!).

Is the idea to host some of these default dictionaries under Swell/platforms (Discover SLES15 vs. SLES12, or even AWS etc.) or just host everything under swell-slurm.yaml externally and keep hard coded ones in the slurm.py?

We should think through this for future development. The most obvious next step is to add Questionary questions for SLURM customization, and then maybe to add some Questionary presets for common cases. Now that all of the SLURM logic is in Python, we can do lots of sophisticated stuff with hueristics based on the experiment configuration, automatic detection of the current system (and maybe even customization based on current system load), etc.

@ashiklom ashiklom marked this pull request as ready for review May 10, 2024 20:12
@ashiklom
Copy link
Collaborator Author

OK, I think this is ready for review. NOTE that this initial pass is a little bit hacky in its implementation:

  1. I added a new -s / --slurm somefile.yaml option to swell create that reads the SLURM config from a special YAML file. (This part of the implementation I'm OK with).
  2. Currently, the SLURM information from (1) is appended to the experiment_dict (and ultimately, to experiment.yaml) as a separate step after all the logic producing experiment.yaml. That might be OK, at least for now, but it feels kludgy. We should either
    a. Commit to storing SLURM stuff in experiment_dict / experiment.yaml and add some Questionary stuff around it; or...
    b. Commit to SLURM stuff being a completely separate file/object and pass it around separately in the SWELL code.

I'm slightly inclined towards (2) here, not least because it seems trying to formulate this in terms of Questionary prompts sounds really complicated, and because SLURM settings seem more like a user- and platform-specific configuration that people might want to manage separately (e.g., gitignore) from the more shareable experiment.yaml/overrides.yaml. But you all have to use this --- what do you think?

@Dooruk Dooruk self-requested a review May 13, 2024 17:54
@Dooruk Dooruk dismissed their stale review May 13, 2024 17:54

out of date comments

Copy link
Collaborator

@Dooruk Dooruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ashiklom! This is a good step that would allow us to implement layout and model resolution dependencies and set defaults for them.

I'm voting for 2.b) Commit to SLURM stuff being a completely separate file/object and pass it around separately in the SWELL code.

I think Questionary (for JEDI config) and slurm.yamls require different logic.

Minor comments while testing, using swell create without the -s flag created slurm directives missing some of the important components not defined in global defaults (nodes and ntasks per_node). So some slurm tasks failed as a result. I guess the fix will be contingent upon how we want to proceed with storing slurm.yaml.

another very minor comment, --job-name used to have model names, e.g., EvaObservations-geos_ocean, it would be nice to keep that if possible. Not super critical though.

src/swell/utilities/slurm.py Show resolved Hide resolved
@ashiklom
Copy link
Collaborator Author

I'm voting for 2.b) Commit to SLURM stuff being a completely separate file/object and pass it around separately in the SWELL code.

Agreed!

Minor comments while testing, using swell create without the -s flag created slurm directives missing some of the important components not defined in global defaults (nodes and ntasks per_node). So some slurm tasks failed as a result. I guess the fix will be contingent upon how we want to proceed with storing slurm.yaml.

Thanks for catching this. I'll look for where these were defined previously and try to add them back in. The -s slurm.yaml should be completely optional and this PR shouldn't break old workflows, so I'll need to fix that before this gets merged.

Can you point me to which specific tasks need this?

another very minor comment, --job-name used to have model names, e.g., EvaObservations-geos_ocean, it would be nice to keep that if possible. Not super critical though.

This should work (https://github.com/GEOS-ESM/swell/pull/339/files#diff-68c8167035a32267716d03f8e6a0ce35fe12d2368732e07fabf202cc008a9e74R118-R120). Not sure why that's being ignored. I'll investigate.

@Dooruk
Copy link
Collaborator

Dooruk commented May 13, 2024

Can you point me to which specific tasks need this?

As far as I'm aware, every SLURM task needs node information. Tasks per node might be optional but that goes hand to hand with the model layout. If MOM6 uses a 24x30 layout, I dedicate 20 notes and 36 tasks per node.

This should work (https://github.com/GEOS-ESM/swell/pull/339/files#diff-68c8167035a32267716d03f8e6a0ce35fe12d2368732e07fabf202cc008a9e74R118-R120). Not sure why that's being ignored. I'll investigate.

Hmm, you are right. wonder if - has a specific meaning in templating?

@ashiklom
Copy link
Collaborator Author

When I run swell create 3dvar, I get flow.cylc files that set the job-name correctly (with both task name and model component); for example:

    [[EvaObservations-geos_ocean]]
        script = "swell task EvaObservations $config -d $datetime -m geos_ocean"
        platform = nccs_discover
        execution time limit = PT30M
        [[[directives]]]
            --job-name = EvaObservations-geos_ocean
            --account = g0613
            --qos = allnccs
            --partition = None
            --nodes = 1
            --ntasks_per_node = 24
            --constraint = cas|sky

Can you confirm that you're not getting this, and if so, send me the command you're running so I can debug?

ashiklom added 12 commits May 13, 2024 19:17
- Read global and task-specific directives from experiment YAML
- Replace hard-coded [[[directives]]] field with Jinja loop for more
  flexibility

NOTE: Custom interactions between SWELL tasks and models (e.g.,
EvaObservations-geos_atmosphere) is still unresolved. Will need to think
about this design.

See GEOS-ESM#308 and GEOS-ESM#325
Add support for model-specific task directives (e.g.,
`EvaObservations-geos_atmosphere`).

The new `experiment.yaml` structure is as follows:

```yaml
slurm_directives_globals:
  account: x1234
  nodes: 1
slurm_directives_tasks:
  RunGEOSExecutable:
    # Defaults for this task for all models
    all:
      nodes: 2
    # Overrides for `RunGEOSExecutable-geos_atmosphere`
    geos_atmosphere:
      nodes: 4
```

See GEOS-ESM#308 and GEOS-ESM#325.
Additional option `-s somefile.yaml` for specifying additional SLURM
directives.

Note that the implementation here adds the SLURM overrides to the
internal `experiment_dict` (and, ultimately, the `experiment.yaml`) with
a bit of error checking to make sure we _only_ use this for SLURM
directives. **This is a bit hacky** --- we should probably either keep this
information completely separate from the `experiment_dict`, or make it a
proper first-class citizen, with Questionary support, etc.
@Dooruk
Copy link
Collaborator

Dooruk commented May 14, 2024

Ok I tested it just now again, the slurm override functionality works. And the default produced:

    [[GenerateBClimatology-geos_ocean]]
        script = "swell task GenerateBClimatology $config -d $datetime -m geos_ocean"
        platform = nccs_discover
        execution time limit = PT1H
        [[[directives]]]
            --job-name = GenerateBClimatology-geos_ocean
            --account = g0613
            --qos = advda
            --partition = None
            --nodes = 1
            --ntasks_per_node = 24
            --constraint = cas|sky

    [[RunJediVariationalExecutable-geos_ocean]]
        script = "swell task RunJediVariationalExecutable $config -d $datetime -m geos_ocean"
        platform = nccs_discover
        execution time limit = PT1H
        [[[directives]]]
            --job-name = RunJediVariationalExecutable-geos_ocean
            --account = g0613
            --qos = advda
            --partition = None
            --nodes = 3
            --ntasks_per_node = 36
            --constraint = cas|sky

Copy link
Collaborator

@Dooruk Dooruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I caught two things:

  1. It is ntasks-per-node not ntasks_per_node
  2. So looks like partition needs special handling. My Slurm submission failed because None is invalid. Below is from the original code, partition is only written in the flow.cylc if it exists. Perhaps we can add a final check before returning target_dict and erase partition keys if they are None? Idk if there is a better solution.
{% if scheduling["BuildGeos"]["partition"]
--partition={{scheduling["BuildGeos"]["partition"]}}
{% endif %}

@Dooruk
Copy link
Collaborator

Dooruk commented May 15, 2024

I had to make changes to ntasks-per-node locally during my runs on another branch and I just noticed that original swell_create_experiment.py also has it as ntasks_per_node. Apparently, that gets accessed to ntasks-per-node through jinja2 handling.

From the original code:

            --ntasks-per-node={{scheduling["BuildJedi"]["ntasks_per_node"]}}

@ashiklom
Copy link
Collaborator Author

ashiklom commented May 16, 2024

@Dooruk Thanks for catching all of that! Sorry for the sloppiness. I'll try to test things more thoroughly before submitting next time. You will have to show me your preferred approach to running some quick SWELL tests without launching massive GEOS simulations, so I can catch some of these things myself before submitting PRs.

partition issue is addressed in c670067 --- I just removed it from the global defaults. nodes-per-task fixed in 1890a54, and I added a validation step in a4d9a73.

@Dooruk
Copy link
Collaborator

Dooruk commented May 16, 2024

@Dooruk Thanks for catching all of that!

Happy to help. I ran your branch (after making a local copy) via Tier 1 tests and some failed, swell code tests are also failing. Let's stay on after the CI meeting tomorrow and I can walk you through what I do briefly.

https://github.com/GEOS-ESM/swell/actions/runs/9118250195

@Dooruk
Copy link
Collaborator

Dooruk commented May 28, 2024

Hi @ashiklom , I just ran this branch again and tier1 tests worked this time. I didn't touch anything so perhaps something was messed up with the github runners the previous time?

The code tests are failing though, so since this is on your personal repo you need to fix that, if you don't mind. Thanks!

@ashiklom
Copy link
Collaborator Author

ashiklom commented May 30, 2024

tier1 tests worked this time

I think this is because the previous tests were broken by a JEDI update (hence #345) that has since been resolved.

The code tests are failing though

My unit tests worked as expected! I just forgot to update them. But good to know they throw errors in situations when there should be an error! Should be fixed now.

Copy link
Collaborator

@Dooruk Dooruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, tier 1 tests work, we are really close.

model_components is not in every experiment dict. so that is the final issue before we merge this, my mod should fix it.

src/swell/utilities/slurm.py Outdated Show resolved Hide resolved
@Dooruk Dooruk self-requested a review June 1, 2024 18:39
@Dooruk
Copy link
Collaborator

Dooruk commented Jun 1, 2024

Thanks, all looks good!

@Dooruk Dooruk merged commit b3995db into GEOS-ESM:develop Jun 1, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SE: SLURM Nodes/Tasks wired in SWELL
2 participants