-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible SLURM configuration #339
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it was "swell create ready but" SLURM needs all of the original attributes it seems. After running swell create 3dvar
, I received the following error:
jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'execution_time_limit'
I really like the utility here, this will be a major update to current setup!
Is the idea to host some of these default dictionaries under Swell/platforms (Discover SLES15 vs. SLES12, or even AWS etc.) or just host everything under swell-slurm.yaml
externally and keep hard coded ones in the slurm.py
?
Sorry, this PR was a bit premature. There are still a few bugs I need to fix. I'll drop the "Draft" tag once this is ready for formal review (though comments are welcome at any time!).
We should think through this for future development. The most obvious next step is to add Questionary questions for SLURM customization, and then maybe to add some Questionary presets for common cases. Now that all of the SLURM logic is in Python, we can do lots of sophisticated stuff with hueristics based on the experiment configuration, automatic detection of the current system (and maybe even customization based on current system load), etc. |
OK, I think this is ready for review. NOTE that this initial pass is a little bit hacky in its implementation:
I'm slightly inclined towards (2) here, not least because it seems trying to formulate this in terms of Questionary prompts sounds really complicated, and because SLURM settings seem more like a user- and platform-specific configuration that people might want to manage separately (e.g., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @ashiklom! This is a good step that would allow us to implement layout and model resolution dependencies and set defaults for them.
I'm voting for 2.b) Commit to SLURM stuff being a completely separate file/object and pass it around separately in the SWELL code.
I think Questionary (for JEDI config) and slurm.yamls require different logic.
Minor comments while testing, using swell create
without the -s
flag created slurm directives missing some of the important components not defined in global defaults (nodes and ntasks per_node). So some slurm tasks failed as a result. I guess the fix will be contingent upon how we want to proceed with storing slurm.yaml
.
another very minor comment, --job-name
used to have model names, e.g., EvaObservations-geos_ocean
, it would be nice to keep that if possible. Not super critical though.
Agreed!
Thanks for catching this. I'll look for where these were defined previously and try to add them back in. The Can you point me to which specific tasks need this?
This should work (https://github.com/GEOS-ESM/swell/pull/339/files#diff-68c8167035a32267716d03f8e6a0ce35fe12d2368732e07fabf202cc008a9e74R118-R120). Not sure why that's being ignored. I'll investigate. |
As far as I'm aware, every SLURM task needs node information. Tasks per node might be optional but that goes hand to hand with the model layout. If MOM6 uses a 24x30 layout, I dedicate 20 notes and 36 tasks per node.
Hmm, you are right. wonder if |
When I run
Can you confirm that you're not getting this, and if so, send me the command you're running so I can debug? |
- Read global and task-specific directives from experiment YAML - Replace hard-coded [[[directives]]] field with Jinja loop for more flexibility NOTE: Custom interactions between SWELL tasks and models (e.g., EvaObservations-geos_atmosphere) is still unresolved. Will need to think about this design. See GEOS-ESM#308 and GEOS-ESM#325
Add support for model-specific task directives (e.g., `EvaObservations-geos_atmosphere`). The new `experiment.yaml` structure is as follows: ```yaml slurm_directives_globals: account: x1234 nodes: 1 slurm_directives_tasks: RunGEOSExecutable: # Defaults for this task for all models all: nodes: 2 # Overrides for `RunGEOSExecutable-geos_atmosphere` geos_atmosphere: nodes: 4 ``` See GEOS-ESM#308 and GEOS-ESM#325.
Additional option `-s somefile.yaml` for specifying additional SLURM directives. Note that the implementation here adds the SLURM overrides to the internal `experiment_dict` (and, ultimately, the `experiment.yaml`) with a bit of error checking to make sure we _only_ use this for SLURM directives. **This is a bit hacky** --- we should probably either keep this information completely separate from the `experiment_dict`, or make it a proper first-class citizen, with Questionary support, etc.
Ok I tested it just now again, the slurm override functionality works. And the default produced:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I caught two things:
- It is
ntasks-per-node
notntasks_per_node
- So looks like
partition
needs special handling. My Slurm submission failed becauseNone
is invalid. Below is from the original code,partition
is only written in theflow.cylc
if it exists. Perhaps we can add a final check before returningtarget_dict
and erasepartition
keys if they areNone
? Idk if there is a better solution.
{% if scheduling["BuildGeos"]["partition"]
--partition={{scheduling["BuildGeos"]["partition"]}}
{% endif %}
I had to make changes to From the original code:
|
@Dooruk Thanks for catching all of that! Sorry for the sloppiness. I'll try to test things more thoroughly before submitting next time. You will have to show me your preferred approach to running some quick SWELL tests without launching massive GEOS simulations, so I can catch some of these things myself before submitting PRs.
|
Happy to help. I ran your branch (after making a local copy) via Tier 1 tests and some failed, swell code tests are also failing. Let's stay on after the CI meeting tomorrow and I can walk you through what I do briefly. https://github.com/GEOS-ESM/swell/actions/runs/9118250195 |
Hi @ashiklom , I just ran this branch again and tier1 tests worked this time. I didn't touch anything so perhaps something was messed up with the github runners the previous time? The code tests are failing though, so since this is on your personal repo you need to fix that, if you don't mind. Thanks! |
I think this is because the previous tests were broken by a JEDI update (hence #345) that has since been resolved.
My unit tests worked as expected! I just forgot to update them. But good to know they throw errors in situations when there should be an error! Should be fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, tier 1 tests work, we are really close.
model_components
is not in every experiment dict. so that is the final issue before we merge this, my mod should fix it.
Thanks, all looks good! |
Description
Support passing arbitrary SLURM configurations via
new option --experiment.yaml
.-s <slurmfile.yaml>
(e.g.,swell create 3dvar -s myslurm.yaml
.The general idea is that more specific and explicit configurations always override generic and implicit (hard-coded) values.
Any values set in the
experiment.yaml
<slurmfile.yaml>
override corresponding hard-coded defaults in the source code, so theexperiment.yaml
<slurmfile.yaml>
can be used to exert complete control over every possible SLURM directive.The structure of the
experiment.yaml
<slurmfile.yaml>
SLURM directives is as follows:Note also: This also allows any valid (or invalid!) SLURM directive to be passed to the configuration.
Inheritance logic and default values are embedded in the
src/swell/utilities/slurm.py
file.The current way to add new default configurations is to modify that file.
Dependencies
None
Impact
Closes #308 and #325.