-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to a cleaner implementation #3
Conversation
Not happy yet, but happier
tests/graph_test.py
Outdated
if getattr(node, 'name', None) == 'c' | ||
] | ||
c_values = [data['value'] for data in c_data] | ||
assert c_values == [1, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does map
guarantee an ordering of the mapped nodes? This seems tricky to pull off given that we rely on NetworkX to store the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cyclebane does not have edge order, but the node names contain the index. The real question is: does Sciline make use of the indices, to guarantee an order in the reduce
operations? Currently it does not, but I think it probably could.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the node names factor into this? The code here simply iterates through result.nodes
but doesn't sort in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify if you meant "why does this test pass", or if you are asking about the general behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify if you meant "why does this test pass",
Essentially, yes. I would like to know if the implementation of the graph can change in the future and break the test. E.g., by switching to a different storage layout.
Your comment
Cyclebane does not have edge order, but the node names contain the index.
led me to think that because the name includes the dim, the nodes are ordered according to the dim. But I don't see how that factors into this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is probably because NetworkX is using dict
internally, i.e., insertion order is preserved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this and other tests to avoid this, see last commit.
|
||
graph = cb.Graph(g) | ||
mapped = graph.map({'a': [1, 2, 3]}) | ||
b = cb.Graph(nx.DiGraph()).map({'b': [11, 12]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you make b
a completely new graph? What happens when you map graph
over both 'a'
and 'b'
independently and use __setitem__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, the indices will not be in conflict, so that would be a different test?
In the current implementation I am not trying to detect compatible indices, so if naming clashes one would still get an error. I am adding a test showing/documenting this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this test is only about the index name not merging subgraphs that were mapped differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On closer look, I think there is a loophole someone in the __setitem__
logic around handling of mapped nodes. It is currently hidden by the rejection of compatible mappings, but might surface unnoticed if we implement that. Need to have a think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote the test, and preventing setting mapped nodes (for now), and added new tests also for that.
src/cyclebane/node_values.py
Outdated
|
||
@staticmethod | ||
def from_array_like(values: Any, *, axis_zero: int = 0) -> ValueArray: | ||
if hasattr(values, 'dims'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems risky to rely on some arbitrary attribute name. Can't you use isinstance
checks instead? (The same applies to if hasattr(self._data_array, 'isel'):
and if (columns := getattr(values, 'columns', None)) is not None:
below.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot use isinstance
, since the libraries might not be installed. I could use a string comparison on the class name instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this?
def _is_xarray_data_array(x: object) -> bool:
try:
import xarray as xr
except ModuleNotFoundError:
return False
return isinstance(x, xr.DataArray)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See update.
src/cyclebane/node_values.py
Outdated
values = self._data_array | ||
for label, i in key: | ||
# This is Scipp notation, Xarray uses the 'isel' method. | ||
values = values[(label, i)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values = values[(label, i)] | |
values = values[label, i] |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used an explicit (rather than implicit) tuple since a reader of Cyclebane might not be familiar with Scipp. I though the notation without only a comma might be more confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this code is comprehensible to people who are unfamiliar with Scipp in any case. But ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that the classes here use shape
instead of sizes
(as in Scipp). Don't the same design considerations apply here that led you to not allow positional indexing in Scipp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, when considering the inputs, then in several cases sizes
or dims
is not available. Named axes are only added by the adapter classes. I have not named it dims
, but stuck closer to Pandas and used index_names
. sizes
would be dict(zip(shape, index_names))
, and I have not needed it in the implementation yet. Did you have a particular code location in mind where it should be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Named axes are only added by the adapter classes.
Exactly. I am asking about the interface of the adapters, not their implementation. I would have expected that the adapters require indexing by dim/index name.
I don't have a particular code location in mind. This is more about the general API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix this now, as it is related to some problems (unsupported cases) in by_position
.
IndexValue = Hashable | ||
|
||
|
||
class ValueArray(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. But not easily extensible for users because the concrete type is selected within NodeValues._to_value_arrays
. If this is a concern, I think you should add a mechanism to allow users to register their own ValueArrays
. E.g.
_VALUE_ARRAY_IMPLEMENTATIONS: dict[type, type] = {}
try:
import pandas as pd
_VALUE_ARRAY_IMPLEMENTATIONS[pd.Series] = PandasSeriesAdapter
except ModuleNotFoundError:
pass
# ...
def register_value_array_adapter(key: type, adapter: type) -> None:
_VALUE_ARRAY_IMPLEMENTATIONS[key] = adapter
And then select an implementation based on the type. (This may need a fallback to SequenceAdapter
if the type is not in the map because there is no unique key for sequences.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a concern for now, I'd think? Supporting Pandas, Numpy, list, Xarray, and Scipp should get us pretty far?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a registry implementation (different from suggestion) after all, while addressing the other comment in instance checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for all adapter types!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer adding this indirectly as tests of Graph
. The adapters are kind of an implementation detail, as I see it. Would you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
tests/graph_test.py
Outdated
@@ -358,7 +373,7 @@ def test_reduce_raises_if_new_node_name_exists() -> None: | |||
|
|||
graph = cb.Graph(g) | |||
mapped = graph.map({'a': [1, 2, 3]}) | |||
with pytest.raises(ValueError): | |||
with pytest.raises(ValueError, match="Node other already exists in the graph."): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.raises(ValueError, match="Node other already exists in the graph."): | |
with pytest.raises(ValueError, match="Node 'other' already exists in the graph."): |
(And in the implementation)
This threw me off on first reading.
src/cyclebane/graph.py
Outdated
root_node_graph = nx.DiGraph() | ||
root_node_graph.add_nodes_from(root_nodes) | ||
graph = nx.compose(self.graph, root_node_graph) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
root_node_graph = nx.DiGraph() | |
root_node_graph.add_nodes_from(root_nodes) | |
graph = nx.compose(self.graph, root_node_graph) | |
graph = self.graph.copy() | |
graph.add_nodes_from(root_nodes) |
This should be equivalent?
src/cyclebane/graph.py
Outdated
if graph.in_degree(root) > 0: | ||
raise ValueError("Node is not a root node") | ||
raise ValueError(f"Mapped node '{root}' is not a source node") | ||
nodes = nx.dfs_successors(graph, root) | ||
successors.update( | ||
set(node for node_list in nodes.values() for node in node_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be something like:
successors.update(nx.descendants(G, source=root) | {root})
I think I have addressed the comments, please have another look! |
src/cyclebane/node_values.py
Outdated
try: | ||
import pandas | ||
except ModuleNotFoundError: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return False | |
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fix here and in other adapter.
This factors the handling of node value slicing out of
Graph
, making it simpler to reason about and work with. An adapter is added to unify handling of various array/series-like objects that can be used when mapping node values.I highly recommend to not review the diff, but the entire files/repo. There has not been a full/detailed review of this so far, and the diff on its own is relatively meaningless.
@jl-wynen As you have reviewed/used the new Sciline I am requesting your review as you are familiar with this.
@MridulS Requesting your review as well, since (a) you are familiar with NetworkX and should be able to spot all my bugs and anti-patterns, and (b) this is a core component underlying a lot of other bits of our software stack, so a third set of eyes is justified.
This should work with scipp/sciline#165.