On steps, stages and records - Launching the composite stage #2162
Replies: 2 comments
-
I agree!
Yes, but I think we should rename the
I think this is a difficult question. In particular we need a language capable of separating between the specific "class" of a stage, and instances of that class. (For example, in the case that you want to run two instances of the same stage with different input data). In other words we need to be able to separate between |
Beta Was this translation helpful? Give feedback.
-
Which is why I suggested to disconnect it from the function step PR and took the time to write the above background ;) I think |
Beta Was this translation helpful? Give feedback.
-
The goal of this issue is to discuss configuration (and maybe implementation) of abstract stages and record passing.
Context
Let me start by highlighting some assumptions and frames for this discussion. My hope is that this will make them easier to challenge, not to write them in stone!
A record is a data element that can be passed through the (forward model) pipeline of ERT. Currently it allows for data that can be represented as JSON. But one could imagine a richer set off data formats in the future!
A step is an atomic unit in the pipeline for which one can reason about the data flow (records) going in and out. Hence, in addition to the description of the execution of a step (which is step type dependent) a step will both list the records it requires as input and the records it produces as a result.
A unix step is a step for which all data is persisted in files and the logic is executed as a shell script on top of these files. Each record (both input and output) is coupled with files on a (local) disk, such that from the scripts perspective the input is present on disk and the sole purpose is to produce the output data on disk in the correct files. As unix steps are currently supported by
ert3
a concrete configuration example can be given:A function step is a step that consists of a single Python function. Also here the input and output records are defined up front. Notice that the function step is thought to be without side effects and hence touching disk is undefined behaviour. A suggestion for configuration would be:
where
The stage
A stage allows for up-front reasoning about input and output records. As of such, all steps are stages.
But, in addition to the steps we want to introduce a stage entity that allows to describe the data flow between stages. I don't have a good naming suggestion as of now, butSuch a stage is called a composite stage. Notice that aabstract stage
,flow stage
,data stage
or something similar might be considered? Until a name is decided I suggest we stick toflow stage
.composite stage
is indeed astage
and will hence again have to be explicit about its input and output records. But in addition we would like it to be able to pass output records of one stage as input records to another stage.Given two composite stages
f
andg
wheref
takes the input recordsu
andv
and producesw
andg
takesx
as input and producesy
andz
. That isf : u, v -> w
andg : x -> y, z
for some notation. I then want to make a composite stageh
that takes as inputm
andn
producesp
andq
by first evaluatingmy_w = f(u=m, v=n)
followed bymy_y, my_z = g(x=my_w)
and returningp=my_y
andq=my_z
.Questions
Since Python is our default language I would suggest that the flow stage feels a bit like a list of Python statements. Perhaps we could even more or less copy the Prefect syntax?
Update
Seems like there is agreement on the assumptions. We should discuss the composite stage and how it is to be configured. Notice, that the ensemble evaluator already supports a multi-stage setup, but bringing this feature to the users in a natural fashion is no small feat.
Beta Was this translation helpful? Give feedback.
All reactions