JSON serializer for task graphs #124

jl-wynen · 2024-02-13T10:43:46Z

First step of #92

Please start with the new notebook to read about the reasoning behind the format and see examples.

SimonHeybrock · 2024-02-14T09:43:43Z

visualize is not about provenance.

This doesn't answer my question. Why is it useful to show the types as separate nodes in visualize but not for provenance tracking?

Not sure why you don't see having the types in the visualization is useful. But computation wise, the graph of providers provides a full definition.

But if we would distinguish between 'input' nodes (which may get serialized values eventually) and 'data' nodes (values never serialized) that would be equivalent.

What would be equivalent?

Adding an extra attribute in a compute node saying "I am computing type X" is equivalent to adding a data node between this compute node and all dependent compute nodes.

table providers are handled the same as others. It is the data nodes that have special handling because they encode which table entry they represent. If this wasn't the case, we would have a lot of duplicate, e.g., float(int) nodes in the graph.

Not an issue if my first item is taken into account.

Well, the special handling would be moved from data to provider nodes. But it would still be there.

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

jl-wynen · 2024-02-14T09:51:04Z

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

And identical return types. Or not easily computer-readable return types (Without special handling, the serialized type becomes, e.g., "builtins.float(builtins.int:0)" and that still has special handling in key_full_qualname)

SimonHeybrock · 2024-02-14T09:55:36Z

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

And identical return types. Or not easily computer-readable return types (Without special handling, the serialized type becomes, e.g., "builtins.float(builtins.int:0)" and that still has special handling in key_full_qualname)

What is the problem with identical return types? You will get N data nodes (all with a unique id).

jl-wynen · 2024-02-14T12:19:22Z

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

And identical return types. Or not easily computer-readable return types (Without special handling, the serialized type becomes, e.g., "builtins.float(builtins.int:0)" and that still has special handling in key_full_qualname)

What is the problem with identical return types? You will get N data nodes (all with a unique id).

Not when you merge data and provider nodes. And even if not, you get a bunch of nodes that are identical up to the id. So you know they represent different things, but cannot tell what they represent.

SimonHeybrock · 2024-02-14T12:22:11Z

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

And identical return types. Or not easily computer-readable return types (Without special handling, the serialized type becomes, e.g., "builtins.float(builtins.int:0)" and that still has special handling in key_full_qualname)

What is the problem with identical return types? You will get N data nodes (all with a unique id).

Not when you merge data and provider nodes.

I think that discussion is orthogonal and does not change anything.

And even if not, you get a bunch of nodes that are identical up to the id. So you know they represent different things, but cannot tell what they represent.

So? I thought this is about provenance. If you look at an arbitrary non-param-table graph you also have no idea what it represents, without knowing which input data was used, do you?

jl-wynen · 2024-02-14T12:46:56Z

So? I thought this is about provenance. If you look at an arbitrary non-param-table graph you also have no idea what it represents, without knowing which input data was used, do you?

But the graph has all graph-related info. Yes, parameters are missing, this was part of the requirements for this task. But with tables, even the graph is incompletely specified:

graph LR
A1("A(Label)") --> B1("B(Label)") --> C
A2("A(Label)") --> B2("B(Label)") --> C
A1("A(Label)") --> D

Which part of the param table is used by D and which by C?

To be honest, I don't fully understand your objection. Which part of the information in teh graph should be omitted?

SimonHeybrock · 2024-02-14T12:53:54Z

But the graph has all graph-related info. Yes, parameters are missing, this was part of the requirements for this task. But with tables, even the graph is incompletely specified:
[...]
Which part of the param table is used by D and which by C?

The value "1" from the table is used by D, both "1" and "2" are used by C? "1" and "2" are "input" nodes.

graph LR
1-->A1
2-->A2
A1("A") --> B1("B") --> C
A2("A") --> B2("B") --> C
A1("A") --> D

nvaytet · 2024-02-14T13:38:00Z

docs/user-guide/provenance.ipynb

+    "It can, however,  only capture part of the actual pipeline.\n",
+    "For example, it only shows the structure of the graph and contains the names of functions and types.\n",
+    "But it does not encode the implementation of those functions or types.\n",
+    "Thus, the graph can only be correctly reconstructed in an environment that contains the same software that was used to write the graph.\n",


Would be nice to show an example of loading the json, building a new task graph from it and computing the result?
But I guess you've only implemented the saving of graphs, not loading them?

I considered writing a loader. But that needs more info than is in the graph right now (parameters, package versions). And it's non-trivial because it needs some importlib finagling, so I wouldn't necessarily show it in the docs.

nvaytet · 2024-02-14T13:50:44Z

src/sciline/visualize.py

            # Example: Series[RowId, Material[Country]] -> RowId, Material[Country]
            return name.partition('[')[-1].rpartition(']')[0]
-        return sgname


This used to return the first name after splitting with [, but is now returning the whole thing?
Is that intentional?

No, it's an oversight. It means that before all subgraphs that return sgname[*] were merged. Is this intentional?

nvaytet · 2024-02-14T13:56:26Z

tests/serialize/json_test.py

+    tg = TaskGraph(graph=graph, keys=str)
+    res = tg.serialize()
+    schema = json_schema()
+    jsonschema.validate(res, schema)


Along the lines of my first comment, I think it would be good if we could have something a bit like a round-trip test where we save the graph, then load it again and compute something with it and check that it's the same as what the original graph gives?

But this implies we have something that can load saved graphs...

jl-wynen · 2024-02-14T15:35:42Z

But the graph has all graph-related info. Yes, parameters are missing, this was part of the requirements for this task. But with tables, even the graph is incompletely specified:
[...]
Which part of the param table is used by D and which by C?

The value "1" from the table is used by D, both "1" and "2" are used by C? "1" and "2" are "input" nodes.
graph LR
1-->A1
2-->A2
A1("A") --> B1("B") --> C
A2("A") --> B2("B") --> C
A1("A") --> D
Loading

Those input nodes correspond to the p_table_cell nodes in the PR and are not regular parameter nodes. So would you preserve those but remove the index information from all dependent nodes? This would mean that you have to walk the graph to find that information.

SimonHeybrock · 2024-02-15T04:08:11Z

But the graph has all graph-related info. Yes, parameters are missing, this was part of the requirements for this task. But with tables, even the graph is incompletely specified:
[...]
Which part of the param table is used by D and which by C?

The value "1" from the table is used by D, both "1" and "2" are used by C? "1" and "2" are "input" nodes.
graph LR
1-->A1
2-->A2
A1("A") --> B1("B") --> C
A2("A") --> B2("B") --> C
A1("A") --> D
Loading
  graph LR
1-->A1
2-->A2
A1("A") --> B1("B") --> C
A2("A") --> B2("B") --> C
A1("A") --> D
Those input nodes correspond to the p_table_cell nodes in the PR and are not regular parameter nodes. So would you preserve those but remove the index information from all dependent nodes? This would mean that you have to walk the graph to find that information.

Yes, absolutely! Remember that Sciline's parameter tables are just a convenient way of building task graphs. They do however have little to do with the resulting task graph and actual computation. As this task is about provenance and reproducibility (independent of Sciline), we have to avoid exposing Sciline "implementation details" in the JSON representation.

In fact I would be tempted to argue that a graph that was made without parameter tables "by hand" should result in the same JSON (modulo something related to the provider that makes the series).

So would you preserve those but remove the index information from all dependent nodes? This would mean that you have to walk the graph to find that information.

Maybe not just remove from the dependent nodes, but also remove the index from the input nodes. We might get the indices into play only in the Series provider.

# Conflicts: # src/sciline/_provider.py # src/sciline/task_graph.py

jl-wynen · 2024-03-01T12:11:22Z

Implemented a new schema with these changes:

no support for param tables
merged data and provider nodes
renamed type -> out
function nodes now encode the order of arguments

SimonHeybrock · 2024-03-04T08:43:10Z

src/sciline/serialize/_json.py

+    node = {
+        'id': node_id,
+        'kind': 'function',
+        'label': provider_name(provider),
+        'function': provider_full_qualname(provider),
+        'out': key_full_qualname(key),
+        'args': args,
+        'kwargs': kwargs,
+    }


Without having looked at everything, I think this looks much better now (with the Sciline-specifics removed) 👍

Question: With the out as part of the node, we make all dependent nodes linked to the id of the provider. Maybe that is not so great, and we should have an id for the data, i.e., re-introduce "data" nodes so they can have their own id? Or do you think it does not matter?

I had a long discussion with Max about this. Separating data and functions is in principle cleaner. It would even remove the need for the parameter kind. And if we ever need to attach attributes to data, we could do so without affecting function nodes.

However, it makes the graph harder to read and the whole process a little more complicated. So the thought was to go with what I implemented now and see how that goes and change later if need be. We still have a little time to figure things out.

Separating data and functions is in principle cleaner.

I thought it would be cleaner especially since we also have input nodes already, a special kind of data node.

However, it makes the graph harder to read

By whom?

and the whole process a little more complicated.

Wouldn't it simply mean that providers need to add two nodes, the compute node and the output data node?

I thought it would be cleaner especially since we also have input nodes already, a special kind of data node.

This is what I meant by

It would even remove the need for the parameter kind

However, it makes the graph harder to read

By whom?

and the whole process a little more complicated.

Wouldn't it simply mean that providers need to add two nodes, the compute node and the output data node?

By humans because they need to follow two edges to find how data relates to each other instead of one. And by computers because they need to take into account the two fundamentally different node types. E.g., for reconstructing a task graph, the reader must extract pairs of nodes for each task instead of just one.

For writing, this is not that much more complicated and well contained.

Not sure I see your argument here. You said yourself initially that having data nodes is cleaner, and I think it would "decouple" the graph more. You say

By humans because they need to follow two edges to find how data relates to each other instead of one.

We show data nodes in Sciline's task graph visualization, and if you think about reading the JSON, then I would argue that seeing "this input depends on this data nodes" is not worse than seeing "it depends on this compute node", especially if we consider that some might, e.g., replace a subgraph with an intermediate result.

And by computers because they need to take into account the two fundamentally different node types. E.g., for reconstructing a task graph, the reader must extract pairs of nodes for each task instead of just one.

But they are connected through a direct edge? Is it actually complicated enough to warrant combining compute and data nodes?

But they are connected through a direct edge? Is it actually complicated enough to warrant combining compute and data nodes?

No. This makes a miniscule difference on all points unless we foresee adding more fields to data nodes.
I'm happy to change back. I mainly went this way because you objected to having separate notes initially.

I mainly went this way because you objected to having separate notes initially.

I did not object, I mainly asked about it! And my main concern there was the discrepancy between inputs (then stored as functions) and data nodes.

It seems Aiida has data nodes? In AiiDA, data provenance is tracked automatically and stored in the form of a directed acyclic graph. For example, each calculation is represented by a node, that is linked to its input and output data nodes.

https://www.aiida.net/sections/about.html#data-provenance

https://aiida.readthedocs.io/projects/aiida-core/en/latest/howto/visualising_graphs.html

Yes, they do. This links into what I hinted at with storing extra metadata in data nodes. If we, e.g., stored intermediate results in a database or in files, we could link to them here, kind of like Aiida does.

SimonHeybrock

I think this looks good now.

jl-wynen added 28 commits February 8, 2024 14:34

Reexport Provider and related types

5e65c08

Add TaskGraph.nodes

2a916dd

Add TaskGraph.edges

fe3a980

More consistent behaviour of keyname

ed15ee5

Use qualname

17856e5

Begin TaskGraph.serialize

7b7e611

Support new type aliases

9ae77d8

Make utils mod protected

a019556

Include builtins in fullqualname

57990ec

Add provider_[full_qual]name

9f6bc48

Serialize all nodes and edges

987d411

Use hash to identify objects

d4f0047

Show type of param, not value

816df2b

Show table index in key_name

6eb0227

Use kind to identify series

b319bf5

Support all provider kinds

45a224a

Move json serializer into subpackage

01fc00b

Fix tests

05c4dd2

Always include index in key_name

fecbb0e

Test serialization with param table

953dd1d

Move serialize tests into submodule

87e929e

Remove nodes and edges iterators

0f53ebc

Fix key_name for python <3.10

a61aadb

Depend on jsonschema in tests

a5f5a7e

Add json schema

869cafd

Begin documenting json serialization

5d43f3c

Import TaskGraph into top level module

c33bdff

More serialization docs

894f5c6

jl-wynen requested review from SimonHeybrock and nvaytet February 13, 2024 10:43

jl-wynen mentioned this pull request Feb 14, 2024

Add TaskGraph.keys #126

Merged

nvaytet reviewed Feb 14, 2024

View reviewed changes

jl-wynen mentioned this pull request Feb 20, 2024

Pipeline representation improvement ideas #136

Open

3 tasks

jl-wynen added 5 commits March 1, 2024 11:44

Simplify json graphs and encode args

ee7b98c

Test more cases

ad873e4

Update JSON schema

c7ae8e6

Update notebook for new JSON schema

8492505

Merge branch 'main' into save-graph

7e55010

# Conflicts: # src/sciline/_provider.py # src/sciline/task_graph.py

SimonHeybrock reviewed Mar 4, 2024

View reviewed changes

jl-wynen added 2 commits March 4, 2024 14:32

Suppress some type checks

9b58c9f

Json: split function and data nodes

f866da6

jl-wynen force-pushed the save-graph branch from 5c6e6de to f866da6 Compare March 11, 2024 10:53

jl-wynen added 2 commits March 13, 2024 10:39

Explicit error for serializing callable objects

e96a883

More tests

3ef4237

SimonHeybrock approved these changes Mar 13, 2024

View reviewed changes

jl-wynen merged commit 9b970ad into main Mar 13, 2024
5 checks passed

jl-wynen deleted the save-graph branch March 13, 2024 12:17

This was referenced Mar 14, 2024

Deserialize and run task graphs #147

Open

save pipeline structure with parameters for reproduction #92

Open

nvaytet mentioned this pull request Apr 3, 2024

Grouping in blue boxes is gone from task graph visualization #151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON serializer for task graphs #124

JSON serializer for task graphs #124

jl-wynen commented Feb 13, 2024

SimonHeybrock commented Feb 14, 2024

jl-wynen commented Feb 14, 2024

SimonHeybrock commented Feb 14, 2024

jl-wynen commented Feb 14, 2024

SimonHeybrock commented Feb 14, 2024

jl-wynen commented Feb 14, 2024

SimonHeybrock commented Feb 14, 2024

nvaytet Feb 14, 2024

jl-wynen Feb 14, 2024

nvaytet Feb 14, 2024

jl-wynen Feb 14, 2024 •

edited

Loading

nvaytet Feb 14, 2024

jl-wynen commented Feb 14, 2024

SimonHeybrock commented Feb 15, 2024 •

edited

Loading

jl-wynen commented Mar 1, 2024

SimonHeybrock Mar 4, 2024

jl-wynen Mar 4, 2024

SimonHeybrock Mar 4, 2024

jl-wynen Mar 4, 2024

SimonHeybrock Mar 4, 2024

jl-wynen Mar 4, 2024

SimonHeybrock Mar 5, 2024 •

edited

Loading

jl-wynen Mar 5, 2024

SimonHeybrock left a comment

JSON serializer for task graphs #124

JSON serializer for task graphs #124

Conversation

jl-wynen commented Feb 13, 2024

SimonHeybrock commented Feb 14, 2024

jl-wynen commented Feb 14, 2024

SimonHeybrock commented Feb 14, 2024

jl-wynen commented Feb 14, 2024

SimonHeybrock commented Feb 14, 2024

jl-wynen commented Feb 14, 2024

SimonHeybrock commented Feb 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jl-wynen Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jl-wynen commented Feb 14, 2024

SimonHeybrock commented Feb 15, 2024 • edited Loading

jl-wynen commented Mar 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonHeybrock Mar 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonHeybrock left a comment

Choose a reason for hiding this comment

jl-wynen Feb 14, 2024 •

edited

Loading

SimonHeybrock commented Feb 15, 2024 •

edited

Loading

SimonHeybrock Mar 5, 2024 •

edited

Loading