Workflow: Automatic parsing of outputs #717

samwaseda · 2023-06-16T15:08:08Z

I saw this PR and I have to admit that it is not only cumbersome to have to write all the output names, but also it is potentially super dangerous if the user has to check the consistency of outputs. The suggestion below is not immediately applicable in that PR, but I think it would allow for a better systematic recognition of output variables:

import ast
import inspect
import re


def remove_spaces_until_character(string):
    pattern = r'\s+(?=\s)'
    modified_string = re.sub(pattern, '', string)
    return modified_string


class ParseOutput:
    def __init__(self, function):
        self._func = function
        self._source = None
        
    @property
    def func(self):
        return self._func
    
    @property
    def node_return(self):
        tree = ast.parse(inspect.getsource(self.func))
        for node in ast.walk(tree):
            if isinstance(node, ast.Return):
                return node

    @property
    def source(self):
        if self._source is None:
            self._source = [l.rsplit("\n", 1)[0] for l in inspect.getsourcelines(self.func)[0]]
        return self._source
    
    def get_string(self, node):
        string = ""
        for ll in range(node.lineno - 1, node.end_lineno):
            if ll == node.lineno - 1 == node.end_lineno - 1:
                string += remove_spaces_until_character(self.source[ll][node.col_offset:node.end_col_offset])
            elif ll == node.lineno - 1:
                string += remove_spaces_until_character(self.source[ll][node.col_offset:])
            elif ll == node.end_lineno - 1:
                string += remove_spaces_until_character(self.source[ll][:node.end_col_offset])
            else:
                string += remove_spaces_until_character(self.source[ll])
        return string
    
    @property
    def output(self):
        if self.node_return is None:
            return
        if isinstance(self.node_return.value, ast.Tuple):
            return [self.get_string(s) for s in self.node_return.value.dims]
        return [self.get_string(self.node_return.value)]

What it does:

def identity(x):
    return x

print(ParseOutput(identity).output)

def add(x, y):
    return x + y

print(ParseOutput(add).output)

def add_and_subtract(x, y):
    return x + y, x - y

print(ParseOutput(add_and_subtract).output)

def md(job):
    temperature = job.output.temperature
    energy = job.output.energy
    return temperature, energy

print(ParseOutput(md).output)

def function_return(i, j):
    return (
        np.arange(
            i, dtype=int
        ),
        np.shape(i, j)
    )

print(ParseOutput(function_return).output)

Output:

['x']
['x + y']
['x + y', 'x - y']
['temperature', 'energy']
['np.arange( i, dtype=int )', 'np.shape(i, j)']

What do you think? @liamhuber

liamhuber · 2023-06-16T18:35:21Z

@samwaseda that is neat! I have two concerns though:

The first is simply technical: if we want to decorate func with @node (or similar), and we want to pass the output labels to the decorator, and we want to get the outputlables using ParseOutput on the function then I guess we run into a cyclic constraint where we can't get started because the function is not defined, e.g.

@node(ParseOutput(add))
def add(x, y):
    return x + y

The second is more philosophical: I think the code here is a nice treatment of the symptom, but I think it would be preferable to treat the underlying disease where the data we pass around is just these big tuples of very granular items. The alternative would be packaging up the output in bundles that can exploit composition. Then we'd use something similar to multiple dispatch (cf. a bit of discussion here here to allow passing that entire collection of data as input to a channel that needs just a part of the data, and let the channel define how to un-package and extract what it needs.

E.g. suppose we had an MDOutput data class with positions, velocities forces, energy_pot, etc. that was (a component of what is) output by MD calculations (regardless of engine). Then we could imagine some sort of node like HeatCapacityByFluctuation node with temepratures and energies inputs. It would accept (different) numpy arrays to each of these, but we could also imagine sending (the same!) MDOutput object to both input channels, and the node definition internally does a type check and if isisntance(temepratures, MDOutput): temperatures = temepratures.temperature or similar. This way we hopefully can avoid these heinous, giant lists of output to begin with.

Of course, the second objection doesn't preclude having the sort of helper tool you suggest! Even if we go the route of composable data packages, such a tool could be useful, it just makes it less urgent of a problem.

samwaseda · 2023-06-16T20:04:17Z

@samwaseda that is neat! I have two concerns though:

The first is simply technical: if we want to decorate func with @node (or similar), and we want to pass the output labels to the decorator, and we want to get the outputlables using ParseOutput on the function then I guess we run into a cyclic constraint where we can't get started because the function is not defined, e.g.
@node(ParseOutput(add))
def add(x, y):
    return x + y

Actually I would then leave the output name empty, i.e.

@node
def add(x, y):
    return x + y

The second is more philosophical: I think the code here is a nice treatment of the symptom, but I think it would be preferable to treat the underlying disease where the data we pass around is just these big tuples of very granular items. The alternative would be packaging up the output in bundles that can exploit composition. Then we'd use something similar to multiple dispatch (cf. a bit of discussion here here to allow passing that entire collection of data as input to a channel that needs just a part of the data, and let the channel define how to un-package and extract what it needs.

E.g. suppose we had an MDOutput data class with positions, velocities forces, energy_pot, etc. that was (a component of what is) output by MD calculations (regardless of engine). Then we could imagine some sort of node like HeatCapacityByFluctuation node with temepratures and energies inputs. It would accept (different) numpy arrays to each of these, but we could also imagine sending (the same!) MDOutput object to both input channels, and the node definition internally does a type check and if isisntance(temepratures, MDOutput): temperatures = temepratures.temperature or similar. This way we hopefully can avoid these heinous, giant lists of output to begin with.

ok I'm not 100% sure if I understand it correctly, but isn't the problem the ontological typing, and not so much about what the variable is called? One way or other, the ontological typing will be required at some point, but in a very different way, i.e.:

def add(x: float, y:float) -> float:
    return x + y

Here the type hinting tells what is the type, and from what I understood this is going to be extended to ontological typing. And this is going to make sure that the output of a node can be correctly attached to the input of a different node. The problem I'm trying to solve here is that in iron flow you need a tag for each output value separately, which is essentially causing the problem that we have to write pretty much the same thing twice (once inside @node and once after return).

samwaseda · 2023-06-16T20:05:05Z

Note: I'm not trying to completely make it impossible to name the output values via @node, but I'd like to make it optional.

liamhuber · 2023-06-16T20:13:53Z

Actually I would then leave the output name empty, i.e.

Ah, I'm daft. You mean to stick it right inside the node class!

Note: I'm not trying to completely make it impossible to name the output values via @node, but I'd like to make it optional.

Yeah, ok, I can see what you're getting at now. The outstanding technical problem then is that this still allows characters in the labels that are not compliant with dot access. Maybe that's fine, and if you chose to return x + y instead of sum = x + y; return sum then you just don't get dot-access. I'd need to dig into the implementation of everything to see if it has side effects, but it should be possible. I like it.

liamhuber · 2023-06-16T20:19:19Z

ok I'm not 100% sure if I understand it correctly, but isn't the problem the ontological typing, and not so much about what the variable is called?

Here the type hinting tells what is the type, and from what I understood this is going to be extended to ontological typing.

I'm not sure if we will directly override python's typing with ontology information or if it will be provided in a supplementary way. I lean towards the latter.

Making the data containers and ontological typing play together is a separate issue though, and the channel labelling stuff you propose here should work independent of that and without interference (at least as far as I can see right now).

liamhuber · 2023-07-13T18:16:20Z

I'm coming back to implement this now, and I need to write down my thoughts anyhow so let's do it here.

Conditions:

Working with a (child of) Function(Node)...
- Either: instantiating one directly, e.g. Function(my_callable, ....)
- Or: defining a new node class with a decorator, e.g. @Workflow.wrap_as.function_node(); def my_callable(...)

Behaviour:

*args are now completely unrelated to output channel labelling, and are provided by a new kwarg output_labels: Optional[str | list | tuple] = None
When left empty, output labels are scraped by inspecting the wrapped function's return value
- Scraped strings may not be dot-access compliant, that is ok
  - It becomes "best practice" to give good names for everything your return
- Scraped strings should always be square-bracket+string-access compliant
- If the function has no return value or always returns None, then the scraped values should be an empty list/tuple
When specified
- If a string or length-1 tuple/list
  - If the length of the scraped values is 0, raise an error
  - Else use it to create a single output channel, regardless of whether this are multiple return values
- Else if length > 1
  - If the length doesn't match the length of the scraped labels, raise an error
  - Else use the provided labels instead of the scraped labels

Questions:

How to handle functions with multiple return branches? I am realizing now that functions that branch and return a different number of values in different branches are already incompatible with our scheme.
- I'm OK constraining this to functions that have exactly one return value at the very end, but we'll need to do some inspection to make sure an error gets raised when this is violated

And use his examples as tests

liamhuber · 2023-07-17T23:04:53Z

#770 makes some changes to the spec outline above. Namely, if output_labels are provided, they are not compared with the scraped labels, rather they simply supercede them. One advantage of this is that we don't need to handle automatic scraping from functions with multiple return expressions -- in such an edge case we can just leave it to the user to provide appropriate label(s). (Of course we might eventually want to handle this edge case, but at least it's not a need. And, of course, a function which optionally returns very different things under different conditions (e.g. returns a different number of objects) will always be dangerous and difficult to handle in a graph-based setting, so it is not a best practice to use these.)

samwaseda added the enhancement New feature or request label Jun 16, 2023

liamhuber added format_black Invoke a black formatting commit and removed format_black Invoke a black formatting commit labels Jun 19, 2023

liamhuber mentioned this issue Jun 19, 2023

Add calc static #715

Merged

liamhuber added the .workflow Pertaining to the workflow submodule label Jun 28, 2023

This was referenced Jun 28, 2023

New class hierarchy and better readiness checking #729

Merged

Suggestions and issues for the workflow class #756

Closed

liamhuber mentioned this issue Jul 13, 2023

Node input as args #769

Merged

liamhuber pushed a commit that referenced this issue Jul 13, 2023

Copy in Sam's output parser from pyiron_contrib issue #717

0887a9b

And use his examples as tests

liamhuber mentioned this issue Jul 13, 2023

Parse output labels automatically #770

Merged

3 tasks

liamhuber closed this as completed in #770 Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow: Automatic parsing of outputs #717

Workflow: Automatic parsing of outputs #717

samwaseda commented Jun 16, 2023 •

edited

Loading

liamhuber commented Jun 16, 2023

samwaseda commented Jun 16, 2023

samwaseda commented Jun 16, 2023

liamhuber commented Jun 16, 2023

liamhuber commented Jun 16, 2023

liamhuber commented Jul 13, 2023

liamhuber commented Jul 17, 2023

Workflow: Automatic parsing of outputs #717

Workflow: Automatic parsing of outputs #717

Comments

samwaseda commented Jun 16, 2023 • edited Loading

liamhuber commented Jun 16, 2023

samwaseda commented Jun 16, 2023

samwaseda commented Jun 16, 2023

liamhuber commented Jun 16, 2023

liamhuber commented Jun 16, 2023

liamhuber commented Jul 13, 2023

liamhuber commented Jul 17, 2023

samwaseda commented Jun 16, 2023 •

edited

Loading