Refactor to a cleaner implementation #3

SimonHeybrock · 2024-05-21T09:00:47Z

This factors the handling of node value slicing out of Graph, making it simpler to reason about and work with. An adapter is added to unify handling of various array/series-like objects that can be used when mapping node values.

I highly recommend to not review the diff, but the entire files/repo. There has not been a full/detailed review of this so far, and the diff on its own is relatively meaningless.

@jl-wynen As you have reviewed/used the new Sciline I am requesting your review as you are familiar with this.
@MridulS Requesting your review as well, since (a) you are familiar with NetworkX and should be able to spot all my bugs and anti-patterns, and (b) this is a core component underlying a lot of other bits of our software stack, so a third set of eyes is justified.

This should work with scipp/sciline#165.

Not happy yet, but happier

src/cyclebane/graph.py

jl-wynen · 2024-05-21T18:16:39Z

tests/graph_test.py

+        if getattr(node, 'name', None) == 'c'
+    ]
+    c_values = [data['value'] for data in c_data]
+    assert c_values == [1, 2]


Does map guarantee an ordering of the mapped nodes? This seems tricky to pull off given that we rely on NetworkX to store the graph.

Cyclebane does not have edge order, but the node names contain the index. The real question is: does Sciline make use of the indices, to guarantee an order in the reduce operations? Currently it does not, but I think it probably could.

How do the node names factor into this? The code here simply iterates through result.nodes but doesn't sort in any way.

Can you clarify if you meant "why does this test pass", or if you are asking about the general behavior?

Can you clarify if you meant "why does this test pass",

Essentially, yes. I would like to know if the implementation of the graph can change in the future and break the test. E.g., by switching to a different storage layout.

Your comment

Cyclebane does not have edge order, but the node names contain the index.

led me to think that because the name includes the dim, the nodes are ordered according to the dim. But I don't see how that factors into this test.

I think it is probably because NetworkX is using dict internally, i.e., insertion order is preserved?

I updated this and other tests to avoid this, see last commit.

jl-wynen · 2024-05-21T18:19:19Z

tests/graph_test.py

+
+    graph = cb.Graph(g)
+    mapped = graph.map({'a': [1, 2, 3]})
+    b = cb.Graph(nx.DiGraph()).map({'b': [11, 12]})


Why did you make b a completely new graph? What happens when you map graph over both 'a' and 'b' independently and use __setitem__?

In that case, the indices will not be in conflict, so that would be a different test?

In the current implementation I am not trying to detect compatible indices, so if naming clashes one would still get an error. I am adding a test showing/documenting this behavior.

So this test is only about the index name not merging subgraphs that were mapped differently?

On closer look, I think there is a loophole someone in the __setitem__ logic around handling of mapped nodes. It is currently hidden by the rejection of compatible mappings, but might surface unnoticed if we implement that. Need to have a think.

Rewrote the test, and preventing setting mapped nodes (for now), and added new tests also for that.

jl-wynen · 2024-05-21T18:28:00Z

src/cyclebane/node_values.py

+
+    @staticmethod
+    def from_array_like(values: Any, *, axis_zero: int = 0) -> ValueArray:
+        if hasattr(values, 'dims'):


It seems risky to rely on some arbitrary attribute name. Can't you use isinstance checks instead? (The same applies to if hasattr(self._data_array, 'isel'): and if (columns := getattr(values, 'columns', None)) is not None: below.)

We cannot use isinstance, since the libraries might not be installed. I could use a string comparison on the class name instead?

What about this?

def _is_xarray_data_array(x: object) -> bool: try: import xarray as xr except ModuleNotFoundError: return False return isinstance(x, xr.DataArray)

See update.

jl-wynen · 2024-05-21T18:29:02Z

src/cyclebane/node_values.py

+        values = self._data_array
+        for label, i in key:
+            # This is Scipp notation, Xarray uses the 'isel' method.
+            values = values[(label, i)]


Suggested change

values = values[(label, i)]

values = values[label, i]

?

I used an explicit (rather than implicit) tuple since a reader of Cyclebane might not be familiar with Scipp. I though the notation without only a comma might be more confusing?

I don't think this code is comprehensible to people who are unfamiliar with Scipp in any case. But ok.

Adding a comment.

src/cyclebane/node_values.py

jl-wynen · 2024-05-21T18:37:36Z

src/cyclebane/node_values.py

I am surprised that the classes here use shape instead of sizes (as in Scipp). Don't the same design considerations apply here that led you to not allow positional indexing in Scipp?

Well, when considering the inputs, then in several cases sizes or dims is not available. Named axes are only added by the adapter classes. I have not named it dims, but stuck closer to Pandas and used index_names. sizes would be dict(zip(shape, index_names)), and I have not needed it in the implementation yet. Did you have a particular code location in mind where it should be used?

Named axes are only added by the adapter classes.

Exactly. I am asking about the interface of the adapters, not their implementation. I would have expected that the adapters require indexing by dim/index name.
I don't have a particular code location in mind. This is more about the general API.

Will fix this now, as it is related to some problems (unsupported cases) in by_position.

jl-wynen · 2024-05-21T18:42:31Z

src/cyclebane/node_values.py

+IndexValue = Hashable
+
+
+class ValueArray(ABC):


Good idea. But not easily extensible for users because the concrete type is selected within NodeValues._to_value_arrays. If this is a concern, I think you should add a mechanism to allow users to register their own ValueArrays. E.g.

_VALUE_ARRAY_IMPLEMENTATIONS: dict[type, type] = {} try: import pandas as pd _VALUE_ARRAY_IMPLEMENTATIONS[pd.Series] = PandasSeriesAdapter except ModuleNotFoundError: pass # ... def register_value_array_adapter(key: type, adapter: type) -> None: _VALUE_ARRAY_IMPLEMENTATIONS[key] = adapter

And then select an implementation based on the type. (This may need a fallback to SequenceAdapter if the type is not in the map because there is no unique key for sequences.)

Not a concern for now, I'd think? Supporting Pandas, Numpy, list, Xarray, and Scipp should get us pretty far?

Added a registry implementation (different from suggestion) after all, while addressing the other comment in instance checks.

jl-wynen · 2024-05-21T18:43:52Z

src/cyclebane/node_values.py

Please add tests for all adapter types!

I think I would prefer adding this indirectly as tests of Graph. The adapters are kind of an implementation detail, as I see it. Would you agree?

src/cyclebane/graph.py

jl-wynen · 2024-05-22T07:22:12Z

tests/graph_test.py

@@ -358,7 +373,7 @@ def test_reduce_raises_if_new_node_name_exists() -> None:

    graph = cb.Graph(g)
    mapped = graph.map({'a': [1, 2, 3]})
-    with pytest.raises(ValueError):
+    with pytest.raises(ValueError, match="Node other already exists in the graph."):


Suggested change

with pytest.raises(ValueError, match="Node other already exists in the graph."):

with pytest.raises(ValueError, match="Node 'other' already exists in the graph."):

(And in the implementation)
This threw me off on first reading.

MridulS · 2024-05-22T15:00:11Z

src/cyclebane/graph.py

+        root_node_graph = nx.DiGraph()
+        root_node_graph.add_nodes_from(root_nodes)
+        graph = nx.compose(self.graph, root_node_graph)


Suggested change

root_node_graph = nx.DiGraph()

root_node_graph.add_nodes_from(root_nodes)

graph = nx.compose(self.graph, root_node_graph)

graph = self.graph.copy()

graph.add_nodes_from(root_nodes)

This should be equivalent?

MridulS · 2024-05-22T15:06:16Z

src/cyclebane/graph.py

        if graph.in_degree(root) > 0:
-            raise ValueError("Node is not a root node")
+            raise ValueError(f"Mapped node '{root}' is not a source node")
        nodes = nx.dfs_successors(graph, root)
        successors.update(
            set(node for node_list in nodes.values() for node in node_list)


This can be something like:

successors.update(nx.descendants(G, source=root) | {root})

src/cyclebane/graph.py

SimonHeybrock · 2024-05-23T10:19:36Z

I think I have addressed the comments, please have another look!

jl-wynen · 2024-05-23T13:28:39Z

src/cyclebane/node_values.py

+        try:
+            import pandas
+        except ModuleNotFoundError:
+            return False


Suggested change

return False

return None

Thanks! Fix here and in other adapter.

SimonHeybrock added 19 commits May 21, 2024 09:40

Allow node insertion via map

c044373

Improve exceptions

431158e

Docstrings

d39c743

Begin improving __getitem__

7736e6e

Begin adding helper objects

caf6014

Begin using NodeValues for validation

d772388

Move check to helper

0946547

Continue refactor

bc1baef

Continue refactor

7695473

Handle node value updated in most places

ddc3977

Use abc.Mapping

faf8b45

Small fixes and cleanup

9d9ec55

Move to new file

27eb425

Cleanup and fix positional slicing

e93e58f

Cleanup and fixes

a33631c

Cleanup

2f927d8

Raise instead of TODO

af8214d

Fix condition

70792f2

Turn value_attr into method arg

8c36e59

SimonHeybrock requested review from MridulS and jl-wynen May 21, 2024 09:00

SimonHeybrock added 3 commits May 21, 2024 12:20

Avoid O(node*value_nodes) scaling

ebfb3d0

Remove unused

f1f335e

Make mypy happier

ecd65e2

Not happy yet, but happier

SimonHeybrock commented May 21, 2024

View reviewed changes

src/cyclebane/graph.py Outdated Show resolved Hide resolved

jl-wynen reviewed May 21, 2024

View reviewed changes

SimonHeybrock added 4 commits May 22, 2024 04:58

Use dataclass with slots=True

bb53178

Address cleanup comments from review

10e3425

Add tests demonstrating index mismatch error in compatible case

cbd97a6

Minor cleanup

4c7190b

Add and improve some tests

dfc2a70

jl-wynen reviewed May 22, 2024

View reviewed changes

MridulS reviewed May 22, 2024

View reviewed changes

src/cyclebane/graph.py Show resolved Hide resolved

SimonHeybrock added 9 commits May 23, 2024 09:10

Simplify networkx usage

3c39b66

Quote node name in error message

03ab32c

Comment on non-standard Scipp indexing notation

c4d34cd

Prevent setitem with mapped nodes, which would require special handling

08c24f5

Add note

58b60ba

Simplify logic

7653df6

Test that node attrs are preserved

fbb2dd8

Do not reuse variable name

77b629e

Cleanup adapter creation and add registry

7c90ff3

Use coords as indices and fix DataArray slicing shortcomings

15fdc27

jl-wynen requested changes May 23, 2024

View reviewed changes

SimonHeybrock added 2 commits May 29, 2024 05:03

Fix return values when modules not found

236ad8d

Avoid relying on node order in tests

ee342ab

jl-wynen approved these changes May 30, 2024

View reviewed changes

Merge branch 'main' into cleanup

f0bc34d

SimonHeybrock enabled auto-merge May 30, 2024 09:09

SimonHeybrock merged commit 8028735 into main May 30, 2024
3 checks passed

SimonHeybrock deleted the cleanup branch May 30, 2024 09:11

	with pytest.raises(ValueError, match="Node other already exists in the graph."):
	with pytest.raises(ValueError, match="Node 'other' already exists in the graph."):

Refactor to a cleaner implementation #3

Refactor to a cleaner implementation #3

Conversation

SimonHeybrock commented May 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonHeybrock May 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonHeybrock commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonHeybrock commented May 21, 2024 •

edited

Loading

SimonHeybrock May 22, 2024 •

edited

Loading