-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON serializer for task graphs #124
Conversation
Not sure why you don't see having the types in the visualization is useful. But computation wise, the graph of providers provides a full definition.
Adding an extra attribute in a compute node saying "I am computing type X" is equivalent to adding a data node between this compute node and all dependent compute nodes.
Why, how? I don't see the need for special handling. You'd just have the same compute node N times (all with a unique |
And identical return types. Or not easily computer-readable return types (Without special handling, the serialized type becomes, e.g., |
What is the problem with identical return types? You will get N data nodes (all with a unique id). |
Not when you merge data and provider nodes. And even if not, you get a bunch of nodes that are identical up to the id. So you know they represent different things, but cannot tell what they represent. |
I think that discussion is orthogonal and does not change anything.
So? I thought this is about provenance. If you look at an arbitrary non-param-table graph you also have no idea what it represents, without knowing which input data was used, do you? |
But the graph has all graph-related info. Yes, parameters are missing, this was part of the requirements for this task. But with tables, even the graph is incompletely specified: graph LR
A1("A(Label)") --> B1("B(Label)") --> C
A2("A(Label)") --> B2("B(Label)") --> C
A1("A(Label)") --> D
Which part of the param table is used by D and which by C? To be honest, I don't fully understand your objection. Which part of the information in teh graph should be omitted? |
The value "1" from the table is used by D, both "1" and "2" are used by C? "1" and "2" are "input" nodes. graph LR
1-->A1
2-->A2
A1("A") --> B1("B") --> C
A2("A") --> B2("B") --> C
A1("A") --> D
|
"It can, however, only capture part of the actual pipeline.\n", | ||
"For example, it only shows the structure of the graph and contains the names of functions and types.\n", | ||
"But it does not encode the implementation of those functions or types.\n", | ||
"Thus, the graph can only be correctly reconstructed in an environment that contains the same software that was used to write the graph.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to show an example of loading the json, building a new task graph from it and computing the result?
But I guess you've only implemented the saving of graphs, not loading them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered writing a loader. But that needs more info than is in the graph right now (parameters, package versions). And it's non-trivial because it needs some importlib finagling, so I wouldn't necessarily show it in the docs.
# Example: Series[RowId, Material[Country]] -> RowId, Material[Country] | ||
return name.partition('[')[-1].rpartition(']')[0] | ||
return sgname |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to return the first name after splitting with [
, but is now returning the whole thing?
Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's an oversight. It means that before all subgraphs that return sgname[*]
were merged. Is this intentional?
tg = TaskGraph(graph=graph, keys=str) | ||
res = tg.serialize() | ||
schema = json_schema() | ||
jsonschema.validate(res, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along the lines of my first comment, I think it would be good if we could have something a bit like a round-trip test where we save the graph, then load it again and compute something with it and check that it's the same as what the original graph gives?
But this implies we have something that can load saved graphs...
Those input nodes correspond to the |
Yes, absolutely! Remember that Sciline's parameter tables are just a convenient way of building task graphs. They do however have little to do with the resulting task graph and actual computation. As this task is about provenance and reproducibility (independent of Sciline), we have to avoid exposing Sciline "implementation details" in the JSON representation. In fact I would be tempted to argue that a graph that was made without parameter tables "by hand" should result in the same JSON (modulo something related to the provider that makes the series).
Maybe not just remove from the dependent nodes, but also remove the index from the input nodes. We might get the indices into play only in the |
# Conflicts: # src/sciline/_provider.py # src/sciline/task_graph.py
Implemented a new schema with these changes:
|
src/sciline/serialize/_json.py
Outdated
node = { | ||
'id': node_id, | ||
'kind': 'function', | ||
'label': provider_name(provider), | ||
'function': provider_full_qualname(provider), | ||
'out': key_full_qualname(key), | ||
'args': args, | ||
'kwargs': kwargs, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without having looked at everything, I think this looks much better now (with the Sciline-specifics removed) 👍
Question: With the out
as part of the node, we make all dependent nodes linked to the id
of the provider. Maybe that is not so great, and we should have an id
for the data, i.e., re-introduce "data" nodes so they can have their own id
? Or do you think it does not matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a long discussion with Max about this. Separating data and functions is in principle cleaner. It would even remove the need for the parameter kind. And if we ever need to attach attributes to data, we could do so without affecting function nodes.
However, it makes the graph harder to read and the whole process a little more complicated. So the thought was to go with what I implemented now and see how that goes and change later if need be. We still have a little time to figure things out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separating data and functions is in principle cleaner.
I thought it would be cleaner especially since we also have input nodes already, a special kind of data node.
However, it makes the graph harder to read
By whom?
and the whole process a little more complicated.
Wouldn't it simply mean that providers need to add two nodes, the compute node and the output data node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be cleaner especially since we also have input nodes already, a special kind of data node.
This is what I meant by
It would even remove the need for the parameter kind
However, it makes the graph harder to read
By whom?
and the whole process a little more complicated.
Wouldn't it simply mean that providers need to add two nodes, the compute node and the output data node?
By humans because they need to follow two edges to find how data relates to each other instead of one. And by computers because they need to take into account the two fundamentally different node types. E.g., for reconstructing a task graph, the reader must extract pairs of nodes for each task instead of just one.
For writing, this is not that much more complicated and well contained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I see your argument here. You said yourself initially that having data nodes is cleaner, and I think it would "decouple" the graph more. You say
By humans because they need to follow two edges to find how data relates to each other instead of one.
We show data nodes in Sciline's task graph visualization, and if you think about reading the JSON, then I would argue that seeing "this input depends on this data nodes" is not worse than seeing "it depends on this compute node", especially if we consider that some might, e.g., replace a subgraph with an intermediate result.
And by computers because they need to take into account the two fundamentally different node types. E.g., for reconstructing a task graph, the reader must extract pairs of nodes for each task instead of just one.
But they are connected through a direct edge? Is it actually complicated enough to warrant combining compute and data nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But they are connected through a direct edge? Is it actually complicated enough to warrant combining compute and data nodes?
No. This makes a miniscule difference on all points unless we foresee adding more fields to data nodes.
I'm happy to change back. I mainly went this way because you objected to having separate notes initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly went this way because you objected to having separate notes initially.
I did not object, I mainly asked about it! And my main concern there was the discrepancy between inputs (then stored as functions) and data nodes.
It seems Aiida has data nodes? In AiiDA, data provenance is tracked automatically and stored in the form of a directed acyclic graph. For example, each calculation is represented by a node, that is linked to its input and output data nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they do. This links into what I hinted at with storing extra metadata in data nodes. If we, e.g., stored intermediate results in a database or in files, we could link to them here, kind of like Aiida does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good now.
First step of #92
Please start with the new notebook to read about the reasoning behind the format and see examples.