Support an alternative flexible allocation scheme that would submit unique copies of the same test on different nodes #2334

vkarak · 2021-12-10T14:44:16Z

This is different from the current flexible tests, which submits one test on multiple nodes. The problem with the current setup is that you probably need to change the benchmark you are running in order to include node information, which also makes much tougher the sanity checking as well as performance checking that would allow us to identify bad nodes.

casparvl · 2022-02-01T16:59:28Z

In #2395 you gave some comments, but since we closed it as duplicate, let's continue here.

Even if you get the list of nodes of interest in a partition, you would still need to parametrize your test differently per partition.

I'm not sure I understand you correctly here. Do you mean we cannot simply do e.g.

parametrize_over_nodes = parameter(get_node_names())
valid_systems = ['sysA:part1', 'sysA:part2']

because the get_node_names() function would need to return a different node set for part1 and part2? If so, I understand, and that is indeed clearly a challenge.

but the workaround is to define the parameter to take the union of the parameter values and then filter in a post-init hook. For nodes, I agree, this is not so nice.

I guess that is the solution we essentially discussed on the Slack channel, right? To give an example, I now developed a test that does:

valid_systems = [
    'A:part1',
    'A:part2'
]

nodelist = {
    'A:part1': ['node1', 'node2', ...],
    'A:part2': ['node10', 'node11', ...],
}

node_list = []
for system in valid_systems:
    if system in nodelist:
        tmplist=[]
        for node in nodelist[system]:
            tmplist.append(([system], node)) # I guess this is what you meant by the union of parameter values?
         node_list.extend(tmplist)
parametrize_over_nodes = parameter(node_list)

   # And this is the post init-hook you refer to, to 'filter' the valid partitions?
   @run_after('init')
    def scope_systems(self):
        self.valid_systems = self.test_size_variant[1]

    @run_before('run')
    def prepare_singlenode_run(self):
        # Check if a single node name has been set
        nodename = self.test_size_variant[2]
        if nodename:
            self.job.options = [f"--nodelist={nodename}"]
...

(ignore the hardcoded nodelist for now, I'm sure we can do that much more elegantly when something like this is properly integrated in the framework). This works, but as you say, having to put the system as part of the tuple to make a parameter that defines essentially has already expanded all system + nodename combinations is not very elegant. It seems like a lot of boiler plate code for something that is probably a pretty commonly desired execution pattern.

I would think that it might be possible to integrate this expansion in the framework though. I.e. the user would specify something in the class body to signal that this is a test of which one copy should be run on each node with a certain status, e.g.

one_per_node = 'IDLE'

And that the framework then generates copies of the test where it takes all (valid) permutations of valid_system + node name combinations. The case doesn't seem so different from e.g. generating the all valid permutations of valid_systems + valid_prog_environs. E.g. a test that does:

valid_prog_environs = ['foss', 'intel']
valid_systems = ['A:part1', 'A:part2']

and a ReFrame settings file that defines:

...
'name': 'part1',
'environs': ['foss', 'intel']
...
'name': 'part2',
'environs': ['foss']

Would also generate tests only for the valid combinations of systems + programming environments (i.e. A:part1-foss, A:part1-intel, A:part2-foss). Suppose that part1 contains node1 and node2, and part2 contains node3 and node4, I wouldn't think it would be too difficult to have the framework only generate A:part1-node1, A:part1-node2, A:part2-node3, A:part2-node4 as valid combinations, as this seems like a pretty 'similar' task.

But then... I obviously don't know the framework as well as you guys, so maybe this is much more difficult for partition+nodename combinations than it is for partition+prog_env's... :)

vkarak · 2022-02-01T21:44:57Z

because the get_node_names() function would need to return a different node set for part1 and part2? If so, I understand, and that is indeed clearly a challenge.

Exactly.

I guess that is the solution we essentially discussed on the Slack channel, right?

Right.

It seems like a lot of boiler plate code for something that is probably a pretty commonly desired execution pattern.

Indeed, but actually it is not so much code. You can write this in a much simpler way:

# NOTE: I'm using sets here for a quick look up in `find_systems` in case you have very long node lists.
nodelists = {
    'A:part1': {'node1', 'node2'},
    'A:part2': {'node10', 'node11'},
}

def find_system(node):
    for system, nodes in nodelists.items():
        if node in nodes:
            return system

class my_test(...):
    nodeid = parameter(itertools.chain(*nodelists.values()))

    @run_after('init')
    def scope_systems(self):
        self.valid_systems = [find_system(self.nodeid)]

In any case, I agree that trying to define a parameter based on the value of a variable, is a recurring pattern that also arises in test libraries, so it should be addressed by the framework.

I think the solution you propose is not going towards the right direction. First, I don't think that an additional test variable is needed. My idea is to allow users to run any single node test flexibly without having to change anything in the test. Users could use tags to mark tests that they will run flexibly. I am leaning towards something like this:

reframe -t <tag> --one-per-node -J reservation=foo ... -r

I don't like very much the --one-per-node name, but you get my point.

Internally, all we need to is to create programmatically a test with a different name for each of the system partitions, deriving from the original one, that will be parameterized on the node list of the corresponding partition. And those tests will be generated only if the current system is listed in the valid_systems of the original test.

vkarak added request for enhancement prio: normal labels Dec 10, 2021

vkarak added this to the ReFrame Sprint 22.01.2 milestone Jan 25, 2022

vkarak assigned ekouts Jan 25, 2022

vkarak modified the milestone: ReFrame Sprint 22.01.2 Jan 25, 2022

vkarak mentioned this issue Feb 1, 2022

Run a single node test on each node in a partition #2395

Closed

vkarak added test syntax cli runtime and removed test syntax cli labels Feb 3, 2022

vkarak modified the milestones: ReFrame Sprint 22.01.2, ReFrame sprint 22.02.1 Feb 4, 2022

vkarak modified the milestones: ReFrame Sprint 22.02.1, ReFrame Sprint 22.02.2 Feb 21, 2022

ekouts mentioned this issue Mar 3, 2022

[feat] Add new command line option to distribute single node jobs on multiple cluster nodes #2458

Merged

vkarak modified the milestones: ReFrame Sprint 22.02.2, ReFrame sprint 22.03.1 Mar 8, 2022

vkarak modified the milestones: ReFrame sprint 22.03.1, ReFrame Sprint 22.03.2 Mar 22, 2022

vkarak closed this as completed in #2458 Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support an alternative flexible allocation scheme that would submit unique copies of the same test on different nodes #2334

Support an alternative flexible allocation scheme that would submit unique copies of the same test on different nodes #2334

vkarak commented Dec 10, 2021

casparvl commented Feb 1, 2022 •

edited

Loading

vkarak commented Feb 1, 2022

Support an alternative flexible allocation scheme that would submit unique copies of the same test on different nodes #2334

Support an alternative flexible allocation scheme that would submit unique copies of the same test on different nodes #2334

Comments

vkarak commented Dec 10, 2021

casparvl commented Feb 1, 2022 • edited Loading

vkarak commented Feb 1, 2022

casparvl commented Feb 1, 2022 •

edited

Loading