On steps, stages and records - Launching the composite stage #2162

markusdregi · 2021-02-11T21:41:30Z

markusdregi
Feb 11, 2021

The goal of this issue is to discuss configuration (and maybe implementation) of abstract stages and record passing.

Context
Let me start by highlighting some assumptions and frames for this discussion. My hope is that this will make them easier to challenge, not to write them in stone!

A record is a data element that can be passed through the (forward model) pipeline of ERT. Currently it allows for data that can be represented as JSON. But one could imagine a richer set off data formats in the future!

A step is an atomic unit in the pipeline for which one can reason about the data flow (records) going in and out. Hence, in addition to the description of the execution of a step (which is step type dependent) a step will both list the records it requires as input and the records it produces as a result.

A unix step is a step for which all data is persisted in files and the logic is executed as a shell script on top of these files. Each record (both input and output) is coupled with files on a (local) disk, such that from the scripts perspective the input is present on disk and the sole purpose is to produce the output data on disk in the correct files. As unix steps are currently supported by ert3 a concrete configuration example can be given:

-
  name: evaluate_polynomial
  type: unix
  input:
    -
      record: coefficients
      location: coefficients.json

  output:
    -
      record: polynomial_output
      location: output.json

  transportable_commands:
    -
      name: poly
      location: poly.py

  script:
    - poly --coefficients coefficients.json --output output.json

A function step is a step that consists of a single Python function. Also here the input and output records are defined up front. Notice that the function step is thought to be without side effects and hence touching disk is undefined behaviour. A suggestion for configuration would be:

-
  name: function_polynomial
  type: function
  input:
    -
      record: coefficients
      location: coeffs

  output:
    -
      record: polynomial_output
      location: output

  function: function_steps.functions:polynomial

where

def second_degree_polynomial(coeffs, x):
     x = np.array(x)
     a, b, c = coeffs["a"], coeffs["b"], coeffs["c"]
     return {"output": list(a * x ** 2 + b * x + c)}

The stage
A stage allows for up-front reasoning about input and output records. As of such, all steps are stages. But, in addition to the steps we want to introduce a stage entity that allows to describe the data flow between stages. I don't have a good naming suggestion as of now, but abstract stage, flow stage, data stage or something similar might be considered? Until a name is decided I suggest we stick to flow stage. Such a stage is called a composite stage. Notice that a composite stage is indeed a stage and will hence again have to be explicit about its input and output records. But in addition we would like it to be able to pass output records of one stage as input records to another stage.

Given two composite stages f and g where f takes the input records u and v and produces w and g takes x as input and produces y and z. That is f : u, v -> w and g : x -> y, z for some notation. I then want to make a composite stage h that takes as input m and n produces p and q by first evaluating my_w = f(u=m, v=n) followed by my_y, my_z = g(x=my_w) and returning p=my_y and q=my_z.

Questions

Do we agree on the assumptions?
Do we agree upon the stage definition?
If so, how do we want to configure a composite stage?

Since Python is our default language I would suggest that the flow stage feels a bit like a list of Python statements. Perhaps we could even more or less copy the Prefect syntax?

Update
Seems like there is agreement on the assumptions. We should discuss the composite stage and how it is to be configured. Notice, that the ensemble evaluator already supports a multi-stage setup, but bringing this feature to the users in a natural fashion is no small feat.

sondreso · 2021-02-12T13:07:35Z

sondreso
Feb 12, 2021
Maintainer

Do we agree on the assumptions?

I agree!

Do we agree upon the stage definition?

Yes, but I think we should rename the flow stage to composite stage, to make it clear that it's something that consist of multiple parts!

If so, how do we want to configure a flow stage?

I think this is a difficult question. In particular we need a language capable of separating between the specific "class" of a stage, and instances of that class. (For example, in the case that you want to run two instances of the same stage with different input data). In other words we need to be able to separate between p and q from the two different instances of g, eg. g_1 and g_2

0 replies

markusdregi · 2021-02-12T13:57:20Z

markusdregi
Feb 12, 2021
Author

I think this is a difficult question.

Which is why I suggested to disconnect it from the function step PR and took the time to write the above background ;)

I think composite stage is an excellent suggestion! Will update the issue :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On steps, stages and records - Launching the composite stage #2162

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

On steps, stages and records - Launching the composite stage #2162

markusdregi Feb 11, 2021

Replies: 2 comments

sondreso Feb 12, 2021 Maintainer

markusdregi Feb 12, 2021 Author

markusdregi
Feb 11, 2021

sondreso
Feb 12, 2021
Maintainer

markusdregi
Feb 12, 2021
Author