Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support an alternative flexible allocation scheme that would submit unique copies of the same test on different nodes #2334

Closed
vkarak opened this issue Dec 10, 2021 · 2 comments · Fixed by #2458

Comments

@vkarak
Copy link
Contributor

vkarak commented Dec 10, 2021

This is different from the current flexible tests, which submits one test on multiple nodes. The problem with the current setup is that you probably need to change the benchmark you are running in order to include node information, which also makes much tougher the sanity checking as well as performance checking that would allow us to identify bad nodes.

@casparvl
Copy link

casparvl commented Feb 1, 2022

In #2395 you gave some comments, but since we closed it as duplicate, let's continue here.

Even if you get the list of nodes of interest in a partition, you would still need to parametrize your test differently per partition.

I'm not sure I understand you correctly here. Do you mean we cannot simply do e.g.

parametrize_over_nodes = parameter(get_node_names())
valid_systems = ['sysA:part1', 'sysA:part2']

because the get_node_names() function would need to return a different node set for part1 and part2? If so, I understand, and that is indeed clearly a challenge.

but the workaround is to define the parameter to take the union of the parameter values and then filter in a post-init hook. For nodes, I agree, this is not so nice.

I guess that is the solution we essentially discussed on the Slack channel, right? To give an example, I now developed a test that does:

valid_systems = [
    'A:part1',
    'A:part2'
]

nodelist = {
    'A:part1': ['node1', 'node2', ...],
    'A:part2': ['node10', 'node11', ...],
}

node_list = []
for system in valid_systems:
    if system in nodelist:
        tmplist=[]
        for node in nodelist[system]:
            tmplist.append(([system], node)) # I guess this is what you meant by the union of parameter values?
         node_list.extend(tmplist)
parametrize_over_nodes = parameter(node_list)

   # And this is the post init-hook you refer to, to 'filter' the valid partitions?
   @run_after('init')
    def scope_systems(self):
        self.valid_systems = self.test_size_variant[1]

    @run_before('run')
    def prepare_singlenode_run(self):
        # Check if a single node name has been set
        nodename = self.test_size_variant[2]
        if nodename:
            self.job.options = [f"--nodelist={nodename}"]
...

(ignore the hardcoded nodelist for now, I'm sure we can do that much more elegantly when something like this is properly integrated in the framework). This works, but as you say, having to put the system as part of the tuple to make a parameter that defines essentially has already expanded all system + nodename combinations is not very elegant. It seems like a lot of boiler plate code for something that is probably a pretty commonly desired execution pattern.

I would think that it might be possible to integrate this expansion in the framework though. I.e. the user would specify something in the class body to signal that this is a test of which one copy should be run on each node with a certain status, e.g.

one_per_node = 'IDLE'

And that the framework then generates copies of the test where it takes all (valid) permutations of valid_system + node name combinations. The case doesn't seem so different from e.g. generating the all valid permutations of valid_systems + valid_prog_environs. E.g. a test that does:

valid_prog_environs = ['foss', 'intel']
valid_systems = ['A:part1', 'A:part2']

and a ReFrame settings file that defines:

...
'name': 'part1',
'environs': ['foss', 'intel']
...
'name': 'part2',
'environs': ['foss']

Would also generate tests only for the valid combinations of systems + programming environments (i.e. A:part1-foss, A:part1-intel, A:part2-foss). Suppose that part1 contains node1 and node2, and part2 contains node3 and node4, I wouldn't think it would be too difficult to have the framework only generate A:part1-node1, A:part1-node2, A:part2-node3, A:part2-node4 as valid combinations, as this seems like a pretty 'similar' task.

But then... I obviously don't know the framework as well as you guys, so maybe this is much more difficult for partition+nodename combinations than it is for partition+prog_env's... :)

@vkarak
Copy link
Contributor Author

vkarak commented Feb 1, 2022

because the get_node_names() function would need to return a different node set for part1 and part2? If so, I understand, and that is indeed clearly a challenge.

Exactly.

I guess that is the solution we essentially discussed on the Slack channel, right?

Right.

It seems like a lot of boiler plate code for something that is probably a pretty commonly desired execution pattern.

Indeed, but actually it is not so much code. You can write this in a much simpler way:

# NOTE: I'm using sets here for a quick look up in `find_systems` in case you have very long node lists.
nodelists = {
    'A:part1': {'node1', 'node2'},
    'A:part2': {'node10', 'node11'},
}

def find_system(node):
    for system, nodes in nodelists.items():
        if node in nodes:
            return system

class my_test(...):
    nodeid = parameter(itertools.chain(*nodelists.values()))

    @run_after('init')
    def scope_systems(self):
        self.valid_systems = [find_system(self.nodeid)]

In any case, I agree that trying to define a parameter based on the value of a variable, is a recurring pattern that also arises in test libraries, so it should be addressed by the framework.

I think the solution you propose is not going towards the right direction. First, I don't think that an additional test variable is needed. My idea is to allow users to run any single node test flexibly without having to change anything in the test. Users could use tags to mark tests that they will run flexibly. I am leaning towards something like this:

reframe -t <tag> --one-per-node -J reservation=foo ... -r

I don't like very much the --one-per-node name, but you get my point.

Internally, all we need to is to create programmatically a test with a different name for each of the system partitions, deriving from the original one, that will be parameterized on the node list of the corresponding partition. And those tests will be generated only if the current system is listed in the valid_systems of the original test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment