You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the data pipeline DAG is defined fixed on compilation and supports only a small option of dynamics e.g. the task ParallelReadFile supports to read files (the number of files are unknown on compilation time).
I would like to have similar dynamics in other areas as well:
Dynamic nodes
The following dynamic nodes could be implemented:
Dynamic tasks
A option to give the Task a python function which is executed on pipeline runtime and returns a list of commands to execute in order.
Dynamic parallel tasks
A option to give the ParallelTask a python function which is executed on pipeline runtime and returns a list of commands / command chains to be executed in parallel.
Dynamic pipeline
A option to define a DynamicPipeline where the nodes are defined within a python function which is executed on pipeline runtime.
Implement UI awareness
The dynamic node objects (Task/ParallelTask/Pipeline) must be defined so that the python function which defines the actual commands/tasks/nodes is not run when interacting with the UI.
Implement node cost handling
These dynamic nodes should be defined so that they define sub-nodes for the dynamic node object. The pipeline execution should then intelligently retract the node cost from the database when the node had been executed in the past. E.g. a dynamic node could represent a export of a database table. By defining the sub-nodes, the pipeline execution can intelligently run the nodes with the highest node cost first to save up execution time.
Example use cases
performing actions against tables on a database (e.g. export table to datalake). We don't know on time of compilation what tables exist in the database
performing actions against a data lake / lakehouse per table on disk (e.g. connecting the table to our database engine). We don't know on time of compilation what tables exist on the data lake / lakehouse.
The text was updated successfully, but these errors were encountered:
Currently the data pipeline DAG is defined fixed on compilation and supports only a small option of dynamics e.g. the task
ParallelReadFile
supports to read files (the number of files are unknown on compilation time).I would like to have similar dynamics in other areas as well:
Dynamic nodes
The following dynamic nodes could be implemented:
Dynamic tasks
A option to give the
Task
a python function which is executed on pipeline runtime and returns a list of commands to execute in order.Dynamic parallel tasks
A option to give the
ParallelTask
a python function which is executed on pipeline runtime and returns a list of commands / command chains to be executed in parallel.Dynamic pipeline
A option to define a
DynamicPipeline
where the nodes are defined within a python function which is executed on pipeline runtime.Implement UI awareness
The dynamic node objects (Task/ParallelTask/Pipeline) must be defined so that the python function which defines the actual commands/tasks/nodes is not run when interacting with the UI.
Implement node cost handling
These dynamic nodes should be defined so that they define sub-nodes for the dynamic node object. The pipeline execution should then intelligently retract the node cost from the database when the node had been executed in the past. E.g. a dynamic node could represent a export of a database table. By defining the sub-nodes, the pipeline execution can intelligently run the nodes with the highest node cost first to save up execution time.
Example use cases
The text was updated successfully, but these errors were encountered: