Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow: Automatic parsing of outputs #717

Closed
samwaseda opened this issue Jun 16, 2023 · 7 comments · Fixed by #770
Closed

Workflow: Automatic parsing of outputs #717

samwaseda opened this issue Jun 16, 2023 · 7 comments · Fixed by #770
Labels
enhancement New feature or request .workflow Pertaining to the workflow submodule

Comments

@samwaseda
Copy link
Member

samwaseda commented Jun 16, 2023

I saw this PR and I have to admit that it is not only cumbersome to have to write all the output names, but also it is potentially super dangerous if the user has to check the consistency of outputs. The suggestion below is not immediately applicable in that PR, but I think it would allow for a better systematic recognition of output variables:

import ast
import inspect
import re


def remove_spaces_until_character(string):
    pattern = r'\s+(?=\s)'
    modified_string = re.sub(pattern, '', string)
    return modified_string


class ParseOutput:
    def __init__(self, function):
        self._func = function
        self._source = None
        
    @property
    def func(self):
        return self._func
    
    @property
    def node_return(self):
        tree = ast.parse(inspect.getsource(self.func))
        for node in ast.walk(tree):
            if isinstance(node, ast.Return):
                return node

    @property
    def source(self):
        if self._source is None:
            self._source = [l.rsplit("\n", 1)[0] for l in inspect.getsourcelines(self.func)[0]]
        return self._source
    
    def get_string(self, node):
        string = ""
        for ll in range(node.lineno - 1, node.end_lineno):
            if ll == node.lineno - 1 == node.end_lineno - 1:
                string += remove_spaces_until_character(self.source[ll][node.col_offset:node.end_col_offset])
            elif ll == node.lineno - 1:
                string += remove_spaces_until_character(self.source[ll][node.col_offset:])
            elif ll == node.end_lineno - 1:
                string += remove_spaces_until_character(self.source[ll][:node.end_col_offset])
            else:
                string += remove_spaces_until_character(self.source[ll])
        return string
    
    @property
    def output(self):
        if self.node_return is None:
            return
        if isinstance(self.node_return.value, ast.Tuple):
            return [self.get_string(s) for s in self.node_return.value.dims]
        return [self.get_string(self.node_return.value)]

What it does:

def identity(x):
    return x

print(ParseOutput(identity).output)

def add(x, y):
    return x + y

print(ParseOutput(add).output)

def add_and_subtract(x, y):
    return x + y, x - y

print(ParseOutput(add_and_subtract).output)

def md(job):
    temperature = job.output.temperature
    energy = job.output.energy
    return temperature, energy

print(ParseOutput(md).output)

def function_return(i, j):
    return (
        np.arange(
            i, dtype=int
        ),
        np.shape(i, j)
    )

print(ParseOutput(function_return).output)

Output:

['x']
['x + y']
['x + y', 'x - y']
['temperature', 'energy']
['np.arange( i, dtype=int )', 'np.shape(i, j)']

What do you think? @liamhuber

@samwaseda samwaseda added the enhancement New feature or request label Jun 16, 2023
@liamhuber
Copy link
Member

@samwaseda that is neat! I have two concerns though:

The first is simply technical: if we want to decorate func with @node (or similar), and we want to pass the output labels to the decorator, and we want to get the outputlables using ParseOutput on the function then I guess we run into a cyclic constraint where we can't get started because the function is not defined, e.g.

@node(ParseOutput(add))
def add(x, y):
    return x + y

The second is more philosophical: I think the code here is a nice treatment of the symptom, but I think it would be preferable to treat the underlying disease where the data we pass around is just these big tuples of very granular items. The alternative would be packaging up the output in bundles that can exploit composition. Then we'd use something similar to multiple dispatch (cf. a bit of discussion here here to allow passing that entire collection of data as input to a channel that needs just a part of the data, and let the channel define how to un-package and extract what it needs.

E.g. suppose we had an MDOutput data class with positions, velocities forces, energy_pot, etc. that was (a component of what is) output by MD calculations (regardless of engine). Then we could imagine some sort of node like HeatCapacityByFluctuation node with temepratures and energies inputs. It would accept (different) numpy arrays to each of these, but we could also imagine sending (the same!) MDOutput object to both input channels, and the node definition internally does a type check and if isisntance(temepratures, MDOutput): temperatures = temepratures.temperature or similar. This way we hopefully can avoid these heinous, giant lists of output to begin with.

Of course, the second objection doesn't preclude having the sort of helper tool you suggest! Even if we go the route of composable data packages, such a tool could be useful, it just makes it less urgent of a problem.

@samwaseda
Copy link
Member Author

@samwaseda that is neat! I have two concerns though:

The first is simply technical: if we want to decorate func with @node (or similar), and we want to pass the output labels to the decorator, and we want to get the outputlables using ParseOutput on the function then I guess we run into a cyclic constraint where we can't get started because the function is not defined, e.g.

@node(ParseOutput(add))
def add(x, y):
    return x + y

Actually I would then leave the output name empty, i.e.

@node
def add(x, y):
    return x + y

The second is more philosophical: I think the code here is a nice treatment of the symptom, but I think it would be preferable to treat the underlying disease where the data we pass around is just these big tuples of very granular items. The alternative would be packaging up the output in bundles that can exploit composition. Then we'd use something similar to multiple dispatch (cf. a bit of discussion here here to allow passing that entire collection of data as input to a channel that needs just a part of the data, and let the channel define how to un-package and extract what it needs.

E.g. suppose we had an MDOutput data class with positions, velocities forces, energy_pot, etc. that was (a component of what is) output by MD calculations (regardless of engine). Then we could imagine some sort of node like HeatCapacityByFluctuation node with temepratures and energies inputs. It would accept (different) numpy arrays to each of these, but we could also imagine sending (the same!) MDOutput object to both input channels, and the node definition internally does a type check and if isisntance(temepratures, MDOutput): temperatures = temepratures.temperature or similar. This way we hopefully can avoid these heinous, giant lists of output to begin with.

ok I'm not 100% sure if I understand it correctly, but isn't the problem the ontological typing, and not so much about what the variable is called? One way or other, the ontological typing will be required at some point, but in a very different way, i.e.:

def add(x: float, y:float) -> float:
    return x + y

Here the type hinting tells what is the type, and from what I understood this is going to be extended to ontological typing. And this is going to make sure that the output of a node can be correctly attached to the input of a different node. The problem I'm trying to solve here is that in iron flow you need a tag for each output value separately, which is essentially causing the problem that we have to write pretty much the same thing twice (once inside @node and once after return).

@samwaseda
Copy link
Member Author

Note: I'm not trying to completely make it impossible to name the output values via @node, but I'd like to make it optional.

@liamhuber
Copy link
Member

Actually I would then leave the output name empty, i.e.

Ah, I'm daft. You mean to stick it right inside the node class!

Note: I'm not trying to completely make it impossible to name the output values via @node, but I'd like to make it optional.

Yeah, ok, I can see what you're getting at now. The outstanding technical problem then is that this still allows characters in the labels that are not compliant with dot access. Maybe that's fine, and if you chose to return x + y instead of sum = x + y; return sum then you just don't get dot-access. I'd need to dig into the implementation of everything to see if it has side effects, but it should be possible. I like it.

@liamhuber
Copy link
Member

ok I'm not 100% sure if I understand it correctly, but isn't the problem the ontological typing, and not so much about what the variable is called?

Here the type hinting tells what is the type, and from what I understood this is going to be extended to ontological typing.

I'm not sure if we will directly override python's typing with ontology information or if it will be provided in a supplementary way. I lean towards the latter.

Making the data containers and ontological typing play together is a separate issue though, and the channel labelling stuff you propose here should work independent of that and without interference (at least as far as I can see right now).

@liamhuber liamhuber added format_black Invoke a black formatting commit and removed format_black Invoke a black formatting commit labels Jun 19, 2023
@liamhuber liamhuber added the .workflow Pertaining to the workflow submodule label Jun 28, 2023
@liamhuber
Copy link
Member

I'm coming back to implement this now, and I need to write down my thoughts anyhow so let's do it here.

Conditions:

  • Working with a (child of) Function(Node)...
    • Either: instantiating one directly, e.g. Function(my_callable, ....)
    • Or: defining a new node class with a decorator, e.g. @Workflow.wrap_as.function_node(); def my_callable(...)

Behaviour:

  • *args are now completely unrelated to output channel labelling, and are provided by a new kwarg output_labels: Optional[str | list | tuple] = None
  • When left empty, output labels are scraped by inspecting the wrapped function's return value
    • Scraped strings may not be dot-access compliant, that is ok
      • It becomes "best practice" to give good names for everything your return
    • Scraped strings should always be square-bracket+string-access compliant
    • If the function has no return value or always returns None, then the scraped values should be an empty list/tuple
  • When specified
    • If a string or length-1 tuple/list
      • If the length of the scraped values is 0, raise an error
      • Else use it to create a single output channel, regardless of whether this are multiple return values
    • Else if length > 1
      • If the length doesn't match the length of the scraped labels, raise an error
      • Else use the provided labels instead of the scraped labels

Questions:

  • How to handle functions with multiple return branches? I am realizing now that functions that branch and return a different number of values in different branches are already incompatible with our scheme.
    • I'm OK constraining this to functions that have exactly one return value at the very end, but we'll need to do some inspection to make sure an error gets raised when this is violated

liamhuber pushed a commit that referenced this issue Jul 13, 2023
@liamhuber
Copy link
Member

#770 makes some changes to the spec outline above. Namely, if output_labels are provided, they are not compared with the scraped labels, rather they simply supercede them. One advantage of this is that we don't need to handle automatic scraping from functions with multiple return expressions -- in such an edge case we can just leave it to the user to provide appropriate label(s). (Of course we might eventually want to handle this edge case, but at least it's not a need. And, of course, a function which optionally returns very different things under different conditions (e.g. returns a different number of objects) will always be dangerous and difficult to handle in a graph-based setting, so it is not a best practice to use these.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request .workflow Pertaining to the workflow submodule
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants