Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON serializer for task graphs #124

Merged
merged 45 commits into from
Mar 13, 2024
Merged

JSON serializer for task graphs #124

merged 45 commits into from
Mar 13, 2024

Conversation

jl-wynen
Copy link
Member

First step of #92

Please start with the new notebook to read about the reasoning behind the format and see examples.

@SimonHeybrock
Copy link
Member

visualize is not about provenance.

This doesn't answer my question. Why is it useful to show the types as separate nodes in visualize but not for provenance tracking?

Not sure why you don't see having the types in the visualization is useful. But computation wise, the graph of providers provides a full definition.

But if we would distinguish between 'input' nodes (which may get serialized values eventually) and 'data' nodes (values never serialized) that would be equivalent.

What would be equivalent?

Adding an extra attribute in a compute node saying "I am computing type X" is equivalent to adding a data node between this compute node and all dependent compute nodes.

table providers are handled the same as others. It is the data nodes that have special handling because they encode which table entry they represent. If this wasn't the case, we would have a lot of duplicate, e.g., float(int) nodes in the graph.

Not an issue if my first item is taken into account.

Well, the special handling would be moved from data to provider nodes. But it would still be there.

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

@jl-wynen
Copy link
Member Author

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

And identical return types. Or not easily computer-readable return types (Without special handling, the serialized type becomes, e.g., "builtins.float(builtins.int:0)" and that still has special handling in key_full_qualname)

@SimonHeybrock
Copy link
Member

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

And identical return types. Or not easily computer-readable return types (Without special handling, the serialized type becomes, e.g., "builtins.float(builtins.int:0)" and that still has special handling in key_full_qualname)

What is the problem with identical return types? You will get N data nodes (all with a unique id).

@jl-wynen
Copy link
Member Author

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

And identical return types. Or not easily computer-readable return types (Without special handling, the serialized type becomes, e.g., "builtins.float(builtins.int:0)" and that still has special handling in key_full_qualname)

What is the problem with identical return types? You will get N data nodes (all with a unique id).

Not when you merge data and provider nodes. And even if not, you get a bunch of nodes that are identical up to the id. So you know they represent different things, but cannot tell what they represent.

@SimonHeybrock
Copy link
Member

Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique id).

And identical return types. Or not easily computer-readable return types (Without special handling, the serialized type becomes, e.g., "builtins.float(builtins.int:0)" and that still has special handling in key_full_qualname)

What is the problem with identical return types? You will get N data nodes (all with a unique id).

Not when you merge data and provider nodes.

I think that discussion is orthogonal and does not change anything.

And even if not, you get a bunch of nodes that are identical up to the id. So you know they represent different things, but cannot tell what they represent.

So? I thought this is about provenance. If you look at an arbitrary non-param-table graph you also have no idea what it represents, without knowing which input data was used, do you?

@jl-wynen jl-wynen mentioned this pull request Feb 14, 2024
@jl-wynen
Copy link
Member Author

So? I thought this is about provenance. If you look at an arbitrary non-param-table graph you also have no idea what it represents, without knowing which input data was used, do you?

But the graph has all graph-related info. Yes, parameters are missing, this was part of the requirements for this task. But with tables, even the graph is incompletely specified:

graph LR
A1("A(Label)") --> B1("B(Label)") --> C
A2("A(Label)") --> B2("B(Label)") --> C
A1("A(Label)") --> D
Loading

Which part of the param table is used by D and which by C?

To be honest, I don't fully understand your objection. Which part of the information in teh graph should be omitted?

@SimonHeybrock
Copy link
Member

But the graph has all graph-related info. Yes, parameters are missing, this was part of the requirements for this task. But with tables, even the graph is incompletely specified:
[...]
Which part of the param table is used by D and which by C?

The value "1" from the table is used by D, both "1" and "2" are used by C? "1" and "2" are "input" nodes.

graph LR
1-->A1
2-->A2
A1("A") --> B1("B") --> C
A2("A") --> B2("B") --> C
A1("A") --> D
Loading

"It can, however, only capture part of the actual pipeline.\n",
"For example, it only shows the structure of the graph and contains the names of functions and types.\n",
"But it does not encode the implementation of those functions or types.\n",
"Thus, the graph can only be correctly reconstructed in an environment that contains the same software that was used to write the graph.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to show an example of loading the json, building a new task graph from it and computing the result?
But I guess you've only implemented the saving of graphs, not loading them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered writing a loader. But that needs more info than is in the graph right now (parameters, package versions). And it's non-trivial because it needs some importlib finagling, so I wouldn't necessarily show it in the docs.

# Example: Series[RowId, Material[Country]] -> RowId, Material[Country]
return name.partition('[')[-1].rpartition(']')[0]
return sgname
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to return the first name after splitting with [, but is now returning the whole thing?
Is that intentional?

Copy link
Member Author

@jl-wynen jl-wynen Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's an oversight. It means that before all subgraphs that return sgname[*] were merged. Is this intentional?

tg = TaskGraph(graph=graph, keys=str)
res = tg.serialize()
schema = json_schema()
jsonschema.validate(res, schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the lines of my first comment, I think it would be good if we could have something a bit like a round-trip test where we save the graph, then load it again and compute something with it and check that it's the same as what the original graph gives?

But this implies we have something that can load saved graphs...

@jl-wynen
Copy link
Member Author

But the graph has all graph-related info. Yes, parameters are missing, this was part of the requirements for this task. But with tables, even the graph is incompletely specified:
[...]
Which part of the param table is used by D and which by C?

The value "1" from the table is used by D, both "1" and "2" are used by C? "1" and "2" are "input" nodes.

graph LR
1-->A1
2-->A2
A1("A") --> B1("B") --> C
A2("A") --> B2("B") --> C
A1("A") --> D
Loading

Those input nodes correspond to the p_table_cell nodes in the PR and are not regular parameter nodes. So would you preserve those but remove the index information from all dependent nodes? This would mean that you have to walk the graph to find that information.

@SimonHeybrock
Copy link
Member

SimonHeybrock commented Feb 15, 2024

But the graph has all graph-related info. Yes, parameters are missing, this was part of the requirements for this task. But with tables, even the graph is incompletely specified:
[...]
Which part of the param table is used by D and which by C?

The value "1" from the table is used by D, both "1" and "2" are used by C? "1" and "2" are "input" nodes.

graph LR
1-->A1
2-->A2
A1("A") --> B1("B") --> C
A2("A") --> B2("B") --> C
A1("A") --> D
Loading
  graph LR

1-->A1
2-->A2
A1("A") --> B1("B") --> C
A2("A") --> B2("B") --> C
A1("A") --> D

Those input nodes correspond to the p_table_cell nodes in the PR and are not regular parameter nodes. So would you preserve those but remove the index information from all dependent nodes? This would mean that you have to walk the graph to find that information.

Yes, absolutely! Remember that Sciline's parameter tables are just a convenient way of building task graphs. They do however have little to do with the resulting task graph and actual computation. As this task is about provenance and reproducibility (independent of Sciline), we have to avoid exposing Sciline "implementation details" in the JSON representation.

In fact I would be tempted to argue that a graph that was made without parameter tables "by hand" should result in the same JSON (modulo something related to the provider that makes the series).

So would you preserve those but remove the index information from all dependent nodes? This would mean that you have to walk the graph to find that information.

Maybe not just remove from the dependent nodes, but also remove the index from the input nodes. We might get the indices into play only in the Series provider.

@jl-wynen
Copy link
Member Author

jl-wynen commented Mar 1, 2024

Implemented a new schema with these changes:

  • no support for param tables
  • merged data and provider nodes
  • renamed type -> out
  • function nodes now encode the order of arguments

Comment on lines 102 to 110
node = {
'id': node_id,
'kind': 'function',
'label': provider_name(provider),
'function': provider_full_qualname(provider),
'out': key_full_qualname(key),
'args': args,
'kwargs': kwargs,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without having looked at everything, I think this looks much better now (with the Sciline-specifics removed) 👍

Question: With the out as part of the node, we make all dependent nodes linked to the id of the provider. Maybe that is not so great, and we should have an id for the data, i.e., re-introduce "data" nodes so they can have their own id? Or do you think it does not matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a long discussion with Max about this. Separating data and functions is in principle cleaner. It would even remove the need for the parameter kind. And if we ever need to attach attributes to data, we could do so without affecting function nodes.

However, it makes the graph harder to read and the whole process a little more complicated. So the thought was to go with what I implemented now and see how that goes and change later if need be. We still have a little time to figure things out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating data and functions is in principle cleaner.

I thought it would be cleaner especially since we also have input nodes already, a special kind of data node.

However, it makes the graph harder to read

By whom?

and the whole process a little more complicated.

Wouldn't it simply mean that providers need to add two nodes, the compute node and the output data node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be cleaner especially since we also have input nodes already, a special kind of data node.

This is what I meant by

It would even remove the need for the parameter kind

However, it makes the graph harder to read

By whom?

and the whole process a little more complicated.

Wouldn't it simply mean that providers need to add two nodes, the compute node and the output data node?

By humans because they need to follow two edges to find how data relates to each other instead of one. And by computers because they need to take into account the two fundamentally different node types. E.g., for reconstructing a task graph, the reader must extract pairs of nodes for each task instead of just one.

For writing, this is not that much more complicated and well contained.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I see your argument here. You said yourself initially that having data nodes is cleaner, and I think it would "decouple" the graph more. You say

By humans because they need to follow two edges to find how data relates to each other instead of one.

We show data nodes in Sciline's task graph visualization, and if you think about reading the JSON, then I would argue that seeing "this input depends on this data nodes" is not worse than seeing "it depends on this compute node", especially if we consider that some might, e.g., replace a subgraph with an intermediate result.

And by computers because they need to take into account the two fundamentally different node types. E.g., for reconstructing a task graph, the reader must extract pairs of nodes for each task instead of just one.

But they are connected through a direct edge? Is it actually complicated enough to warrant combining compute and data nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they are connected through a direct edge? Is it actually complicated enough to warrant combining compute and data nodes?

No. This makes a miniscule difference on all points unless we foresee adding more fields to data nodes.
I'm happy to change back. I mainly went this way because you objected to having separate notes initially.

Copy link
Member

@SimonHeybrock SimonHeybrock Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly went this way because you objected to having separate notes initially.

I did not object, I mainly asked about it! And my main concern there was the discrepancy between inputs (then stored as functions) and data nodes.

It seems Aiida has data nodes? In AiiDA, data provenance is tracked automatically and stored in the form of a directed acyclic graph. For example, each calculation is represented by a node, that is linked to its input and output data nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they do. This links into what I hinted at with storing extra metadata in data nodes. If we, e.g., stored intermediate results in a database or in files, we could link to them here, kind of like Aiida does.

Copy link
Member

@SimonHeybrock SimonHeybrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good now.

@jl-wynen jl-wynen merged commit 9b970ad into main Mar 13, 2024
5 checks passed
@jl-wynen jl-wynen deleted the save-graph branch March 13, 2024 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants