Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add new command line option to distribute single node jobs on multiple cluster nodes #2458

Merged
merged 53 commits into from
Apr 13, 2022

Conversation

ekouts
Copy link
Contributor

@ekouts ekouts commented Mar 3, 2022

In this PRs there two new features for the slurm and local partitions:

1. The job object includes the pin_nodes attribute.

class Foo(rfm.RegressionTest):
    ...
    # Pin nodes for compilation jobs
    @run_before('compile')
    def pin_compilation_node(self):
        if self.local or self.build_locally:
            return

        self.build_job.pin_nodes = ['nid004']

    # Pin nodes for run jobs
    @run_before('run')
    def pin_run_node(self):
        if self.local or self.build_locally:
            return

        self.job.pin_nodes = ['nid001', 'nid002']

The attribute has an effect only on Slurm. It will pass --nodelist={} on the job scripts with an abbreviated string of the nodes.
On the other schedulers it is ignored.

2. The user can pass the argument --distribute and parameterize all the tests from the cli.

By default ReFrame will parameterize all the selected tests to all the nodes of the valid partitions in the requested state, taking into account the cli job options.

  • ReFrame will first filter all the tests based on the filtering options the user passes (-n, --system etc) and then it will dynamically create dynamic parameterized tests on each partition for all the nodes.
  • We can choose the state of nodes that we want to take into account. By passing --distribute=idle we get all nodes in state idle. To get all nodes of the partition we need to pass --distribute=all .

Closes #2334 .

@pep8speaks
Copy link

pep8speaks commented Mar 3, 2022

Hello @ekouts, Thank you for updating!

Cheers! There are no PEP8 issues in this Pull Request!Do see the ReFrame Coding Style Guide

Comment last updated at 2022-04-13 15:57:30 UTC

@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2022

Codecov Report

Merging #2458 (9f6d4ed) into master (8c178ab) will increase coverage by 0.06%.
The diff coverage is 82.75%.

@@            Coverage Diff             @@
##           master    #2458      +/-   ##
==========================================
+ Coverage   85.81%   85.87%   +0.06%     
==========================================
  Files          57       58       +1     
  Lines       10705    10772      +67     
==========================================
+ Hits         9186     9250      +64     
- Misses       1519     1522       +3     
Impacted Files Coverage Δ
reframe/core/meta.py 99.38% <ø> (+0.30%) ⬆️
reframe/frontend/cli.py 71.33% <52.94%> (+0.28%) ⬆️
reframe/frontend/statistics.py 94.17% <55.55%> (-0.46%) ⬇️
reframe/core/schedulers/slurm.py 52.29% <66.66%> (-0.02%) ⬇️
reframe/frontend/distribute_tests.py 96.29% <96.29%> (ø)
reframe/core/schedulers/__init__.py 98.29% <100.00%> (+<0.01%) ⬆️
reframe/core/schedulers/local.py 92.62% <100.00%> (+0.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c178ab...9f6d4ed. Read the comment docs.

@ekouts ekouts marked this pull request as draft March 3, 2022 12:24
@ekouts
Copy link
Contributor Author

ekouts commented Apr 12, 2022

One thing we still need to understand is where to put the distribute_tests()

@vkarak I moved distribute_tests and getallnodes to a different file distribute_tests.py in the frontend. What do you think?

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor things still in the code + the following that I discovered by running it.

  1. You need to define a formatting function for the $nid parameter so that it is formatted in a more friendly way. Practically convert the list into an abbreviated host list.
  2. I don't think that we do need the _D_ prefix/suffix. Since we do change the name based on the partition's name.
  3. The --distribute option should assume a default (perhaps idle) if no argument is passed.
  4. When I run this --distribute=idle -J nodelist='nid0000[1-3]' the test is not parametrised on just the three nodes I pass, but on all the idle nodes of the system.

Other than that, this feature is beautiful!

@ekouts
Copy link
Contributor Author

ekouts commented Apr 12, 2022

Just some minor things still in the code + the following that I discovered by running it.

1. You need to define a formatting function for the `$nid` parameter so that it is formatted in a more friendly way. Practically convert the list into an abbreviated host list.

2. I don't think that we do need the `_D_` prefix/suffix. Since we do change the name based on the partition's name.

3. The `--distribute` option should assume a default (perhaps `idle`) if no argument is passed.

4. When I run this `--distribute=idle -J nodelist='nid0000[1-3]'` the test is not parametrised on just the three nodes I pass, but on all the idle nodes of the system.

Other than that, this feature is beautiful!

  1. Done
  2. Should I completely remove it or put it in the end? Because in a comment in the review you suggest to put _D_ in the end.
  3. Done
  4. Strange.. This must be a bug in the filtering function of slurm because I had tried with -J w=nids and it is working.

@vkarak
Copy link
Contributor

vkarak commented Apr 13, 2022

Should I completely remove it or put it in the end? Because in a comment in the review you suggest to put D in the end.

I would probably remove it completely.

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor suggestion still.

@vkarak
Copy link
Contributor

vkarak commented Apr 13, 2022

Strange.. This must be a bug in the filtering function of slurm because I had tried with -J w=nids and it is working.

Indeed it works with -w but not with nodelist. The same happens with the -J reservation. There's something probably wrong with the long options. I need to check.

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now!

@vkarak vkarak changed the title [feat] Support an alternative flexible allocation scheme that would submit unique copies of the same test on different nodes [feat] Add new command line option to distribute single node jobs on multiple cluster nodes Apr 13, 2022
@vkarak vkarak merged commit f23b11e into reframe-hpc:master Apr 13, 2022
@ekouts ekouts deleted the feat/alt_flex_alloc branch April 13, 2022 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants