Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of Sciline's Pipeline #165

Merged
merged 68 commits into from
Jun 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
593cde5
Prototype: Rewrite from scratch
SimonHeybrock Apr 3, 2024
ab1ec1e
Support other graph in __setitem__
SimonHeybrock Apr 3, 2024
b95e72f
Handle Optional
SimonHeybrock Apr 3, 2024
0082025
Remove implemented plan
SimonHeybrock Apr 3, 2024
92b0bda
Add to do notes
SimonHeybrock Apr 3, 2024
8ecb956
Rename to DataGraph
SimonHeybrock Apr 5, 2024
b882385
Support generic providers with constraints
SimonHeybrock Apr 5, 2024
ce90f72
Support multiple targets in build
SimonHeybrock Apr 5, 2024
1406595
Fix some tests
SimonHeybrock Apr 5, 2024
de6086d
Fix more tests
SimonHeybrock Apr 5, 2024
98e0163
Cleanup
SimonHeybrock Apr 5, 2024
64a7f1d
Cleanup
SimonHeybrock Apr 5, 2024
9c5af48
Minor
SimonHeybrock Apr 5, 2024
d90c93f
Try with Cyclebane
SimonHeybrock Apr 9, 2024
4f80dbd
Add notebook with examples and thoughts about rewrite
SimonHeybrock Apr 9, 2024
06d5cdf
Minor
SimonHeybrock Apr 11, 2024
c7a90ed
Wrap cyclebane.Graph
SimonHeybrock Apr 11, 2024
15f3d36
Cleanup and remove Optional/Union handling
SimonHeybrock Apr 11, 2024
5d632d4
Cleanup
SimonHeybrock Apr 11, 2024
3003b0c
Add functions from untracked file
SimonHeybrock Apr 15, 2024
d541e35
Update open questions
SimonHeybrock Apr 15, 2024
aaaff1b
Cleanup handling of `reduce`
SimonHeybrock Apr 15, 2024
8730a23
Strip old pipeline
SimonHeybrock Apr 15, 2024
aeaaf6c
Reintroduce handler handling
SimonHeybrock Apr 15, 2024
f08c668
Update syntax
SimonHeybrock Apr 15, 2024
697a62d
Revert removal of `visualize`
SimonHeybrock Apr 17, 2024
0dc447e
Fix some smaller issues
SimonHeybrock Apr 17, 2024
0421e47
Simplify
SimonHeybrock Apr 18, 2024
eda0181
Remove Series and ParamTable
SimonHeybrock Apr 24, 2024
797a610
Fix more tests
SimonHeybrock Apr 24, 2024
33c0b98
Move TaskGraph wrapping out of DataGraph
SimonHeybrock Apr 24, 2024
e0f701d
Fix more tests
SimonHeybrock Apr 24, 2024
bb66d72
Update notebook
SimonHeybrock Apr 24, 2024
1f06dac
Update docs
SimonHeybrock Apr 24, 2024
06290ae
Cleanup
SimonHeybrock May 1, 2024
d6c2e2d
Remove tests with ParamTable
SimonHeybrock May 1, 2024
ae2d705
Remove Item and Label, fix tests
SimonHeybrock May 1, 2024
016684b
Fix copy type hints
SimonHeybrock May 1, 2024
353486a
Fix some type hints
SimonHeybrock May 1, 2024
a20b688
Small fixes
SimonHeybrock May 1, 2024
db91a71
Fix or ignore mypy Key issues
SimonHeybrock May 1, 2024
fc57454
Fix mypy and remove deprecated test
SimonHeybrock May 1, 2024
a252690
Remove defunct tests that passed because of insertion order
SimonHeybrock May 2, 2024
d9fc070
Disallow repeated arguments in providers
SimonHeybrock May 2, 2024
4bdff4e
Avoid named intermediates
SimonHeybrock May 2, 2024
da62b26
Mark hepers as private
SimonHeybrock May 2, 2024
40e2ed1
Improve type hints
SimonHeybrock May 2, 2024
cc9d259
Satisfy mypy
SimonHeybrock May 2, 2024
44752a3
Minor docs
SimonHeybrock May 2, 2024
d026062
Fix generic providers docs notebook
SimonHeybrock May 7, 2024
335b9f9
Fix visualize for mapped graphs
SimonHeybrock May 8, 2024
dbf7464
Update param table docs
SimonHeybrock May 8, 2024
e5ebefc
Add cyclebane dep and pass docs
SimonHeybrock May 8, 2024
e745fa6
Mypy
SimonHeybrock May 8, 2024
52c1646
Update design doc
SimonHeybrock May 8, 2024
bcee03a
Merge remote-tracking branch 'origin/main' into v2-prototype
SimonHeybrock May 8, 2024
364a39a
Map also return type
SimonHeybrock May 16, 2024
2faa71b
Docstring
SimonHeybrock May 16, 2024
70fe2f2
Mark method private
SimonHeybrock May 16, 2024
c91dab8
Move __copy__
SimonHeybrock May 16, 2024
cf3f1ca
Remove `build`
SimonHeybrock May 16, 2024
ee9c3f2
Update cyclebane branch
SimonHeybrock May 16, 2024
90ccffd
Fix mypy
SimonHeybrock May 16, 2024
ec0d12b
Update src/sciline/data_graph.py
SimonHeybrock May 21, 2024
bcca62a
Update docs/developer/architecture-and-design/rewrite.ipynb
SimonHeybrock May 21, 2024
afc5c75
Add note on typevars needing constraints
SimonHeybrock May 21, 2024
c412cd1
Give example using dict
SimonHeybrock May 21, 2024
a6872d8
Use released cyclebane
SimonHeybrock May 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/api-reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,15 @@
:template: class-template.rst
:recursive:

ParamTable
Pipeline
Scope
Series
ScopeTwoParams
scheduler.Scheduler
scheduler.DaskScheduler
scheduler.NaiveScheduler
TaskGraph
HandleAsBuildTimeException
HandleAsComputeTimeException
```

## Exceptions
Expand All @@ -28,7 +29,6 @@
:template: class-template.rst
:recursive:

AmbiguousProvider
UnboundTypeVar
UnsatisfiedRequirement
```
Expand Down
281 changes: 281 additions & 0 deletions docs/developer/architecture-and-design/rewrite.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Rewrite of Sciline's Pipeline as a Data Graph\n",
"\n",
"## Introduction\n",
"\n",
"There has been a series of issues and discussions about Sciline's `Pipeline` class and its implementation.\n",
"\n",
"- Detect unused parameters [#43](https://github.com/scipp/sciline/issues/43).\n",
"- More helpful error messages when pipeline fails to build or compute? [#74](https://github.com/scipp/sciline/issues/74).\n",
"- Get missing params from a pipeline [#83](https://github.com/scipp/sciline/issues/83).\n",
"- Support for graph operations [#107](https://github.com/scipp/sciline/issues/107).\n",
"- Supporting different file handle types is too difficult [#140](https://github.com/scipp/sciline/issues/140).\n",
"- A new approach for \"parameter tables\" [#141](https://github.com/scipp/sciline/issues/141).\n",
"- Pruning for repeated workflow calls [#148](https://github.com/scipp/sciline/issues/148)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Current implementation\n",
"\n",
"- `sciline.Pipeline` is a box that can be filled with providers (a provider is callable that can compute a value) as well as values.\n",
"- Providers can provide generic types.\n",
" The concrete types and values that such providers compute is determined *later*, when the pipeline is built, based on which instances of the generic outputs are requested (by other providers or by the user when building the pipeline).\n",
"- Parameter tables and a special `sciline.Series` type are supported to create task graphs with duplicate branches and \"reduction\" or grouping operations.\n",
"- The pipeline is built by calling `build` on it, which returns a `sciline.TaskGraph`.\n",
" Most of the complexity is handled in this step.\n",
"\n",
"The presence of generic providers as well as parameter tables makes the implementation of the pipeline quite complex.\n",
"It implies that internally a pipeline is *not* representable as a graph, as (1) generics lead to a task-graph structure that is in principle undefined until the pipeline is built, and (2) parameter tables lead to implicit duplication of task graph branches, which means that if `Pipeline` would internally use a graph representation, adding or replacing providers would conflict with the duplicate structure."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Proposal\n",
"\n",
"The key idea of this proposal is to introduce `sciline.DataGraph`, a directed acyclic graph (DAG), which can roughly be thought of a graph representation of the pipeline.\n",
"The data graph describes dependencies between data, defined via the type-hints of providers.\n",
"Providers (or values) are stored as node data.\n",
"\n",
"As the support for generic providers was a hindrance in the current implementation, we propose to restrict this to generic return types *with constraints*.\n",
"This means that such a provider defines a *known* set of outputs, and the data graph can thus be updated with multiple nodes, each with the same provider.\n",
"\n",
"The support for parameter tables would be replaced by using `map` and `reduce` operations on the data graph.<sup id=\"a2\">[2](#f2)</sup>\n",
"\n",
"1. <span id=\"f1\">[^](#a1)</span>\n",
" Whether `Pipeline` will be kept as a wrapper around `DataGraph` or whether `DataGraph` will be the main interface is not yet clear.\n",
"2. <span id=\"f2\">[^](#a2)</span>\n",
" This has been prototyped in the `cyclebane` library.\n",
" Whether this would be *integrated into* or *used by* Sciline is not yet clear.\n",
"\n",
"### Note on chosen implementation\n",
"\n",
"Keeping the existing `Pipeline` interface, the new functionality has been added in the `DataGraph` class, which has been made a base class of `Pipeline`.\n",
"`DataGraph` is implemented as a wrapper for `cyclebane.Graph`, a new and generic support library based on NetworkX.\n",
"\n",
"### Example 1: Basic DataGraph"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sciline\n",
"\n",
"\n",
"def f1() -> float:\n",
" return 1.0\n",
"\n",
"\n",
"def f2(a: float, b: str) -> int:\n",
" return int(a) + len(b)\n",
"\n",
"\n",
"def f3(a: int) -> list[int]:\n",
" return list(range(a))\n",
"\n",
"\n",
"data_graph = sciline.Pipeline([f1, f3, f2])\n",
"data_graph.visualize_data_graph(graph_attr={'rankdir': 'LR'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can add a value for `str` using `__setitem__`, build a `sciline.TaskGraph`, and compute the result:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_graph[str] = 'abcde'\n",
"task_graph = data_graph.get(list[int])\n",
"task_graph.compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"task_graph.visualize(graph_attr={'rankdir': 'LR'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 2: DataGraph with generic provider"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from typing import TypeVar\n",
"import sciline\n",
"\n",
"T = TypeVar('T', int, float) # The constraints are mandatory now!\n",
"\n",
"\n",
"def make_list(length: T) -> list[T]:\n",
" return [length, length + length]\n",
"\n",
"\n",
"def make_dict(key: list[int], value: list[float]) -> dict[int, float]:\n",
" return dict(zip(key, value))\n",
"\n",
"\n",
"data_graph = sciline.Pipeline([make_list, make_dict])\n",
"data_graph.visualize_data_graph(graph_attr={'rankdir': 'LR'})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_graph[int] = 3\n",
"data_graph[float] = 1.2\n",
"data_graph.get(dict[int, float]).compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 3: DataGraph with map and reduce"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sciline\n",
"\n",
"\n",
"def f1(x: float) -> str:\n",
" return str(x)\n",
"\n",
"\n",
"def f2(x: str) -> int:\n",
" return len(x)\n",
"\n",
"\n",
"def f3(a: int) -> list[int]:\n",
" return list(range(a))\n",
"\n",
"\n",
"data_graph = sciline.Pipeline([f1, f2, f3])\n",
"data_graph.visualize_data_graph(graph_attr={'rankdir': 'LR'})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"params = pd.DataFrame({float: [0.1, 1.0, 10.0]})\n",
"params"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def concat_strings(*strings: str) -> str:\n",
" return '+'.join(strings)\n",
"\n",
"\n",
"data_graph[str] = data_graph[str].map(params).reduce(func=concat_strings)\n",
"data_graph.visualize_data_graph(graph_attr={'rankdir': 'LR'})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tg = data_graph.get(list[int])\n",
"tg.visualize(graph_attr={'rankdir': 'LR'})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tg.compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Criticism\n",
"\n",
"The `map` and `reduce` operations kind of break out of the core idea of Sciline.\n",
"It is some sort of intermediate state between declarative and imperative programming (as in Sciline and Dask, respectively).\n",
"The example above may be re-imagined as something along the lines of\n",
"\n",
"```python\n",
"# Assuming with_value returns a copy of the graph with the value set\n",
"branches = map(data_graph[str].with_value, params[float])\n",
"# Not actually `dask.delayed`, but you get the idea\n",
"data_graph[str] = dask.delayed(concat_strings)(branches)\n",
"```\n",
"\n",
"The graph could then be optimized to remove duplicate nodes (part of `data_graph[str]`, but not an descendant of `float`)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "dev310",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading