Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to a cleaner implementation #3

Merged
merged 40 commits into from
May 30, 2024
Merged

Refactor to a cleaner implementation #3

merged 40 commits into from
May 30, 2024

Conversation

SimonHeybrock
Copy link
Member

@SimonHeybrock SimonHeybrock commented May 21, 2024

This factors the handling of node value slicing out of Graph, making it simpler to reason about and work with. An adapter is added to unify handling of various array/series-like objects that can be used when mapping node values.

I highly recommend to not review the diff, but the entire files/repo. There has not been a full/detailed review of this so far, and the diff on its own is relatively meaningless.

@jl-wynen As you have reviewed/used the new Sciline I am requesting your review as you are familiar with this.
@MridulS Requesting your review as well, since (a) you are familiar with NetworkX and should be able to spot all my bugs and anti-patterns, and (b) this is a core component underlying a lot of other bits of our software stack, so a third set of eyes is justified.

This should work with scipp/sciline#165.

@SimonHeybrock SimonHeybrock requested review from MridulS and jl-wynen May 21, 2024 09:00
src/cyclebane/graph.py Outdated Show resolved Hide resolved
if getattr(node, 'name', None) == 'c'
]
c_values = [data['value'] for data in c_data]
assert c_values == [1, 2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does map guarantee an ordering of the mapped nodes? This seems tricky to pull off given that we rely on NetworkX to store the graph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cyclebane does not have edge order, but the node names contain the index. The real question is: does Sciline make use of the indices, to guarantee an order in the reduce operations? Currently it does not, but I think it probably could.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the node names factor into this? The code here simply iterates through result.nodes but doesn't sort in any way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify if you meant "why does this test pass", or if you are asking about the general behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify if you meant "why does this test pass",

Essentially, yes. I would like to know if the implementation of the graph can change in the future and break the test. E.g., by switching to a different storage layout.

Your comment

Cyclebane does not have edge order, but the node names contain the index.

led me to think that because the name includes the dim, the nodes are ordered according to the dim. But I don't see how that factors into this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is probably because NetworkX is using dict internally, i.e., insertion order is preserved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this and other tests to avoid this, see last commit.


graph = cb.Graph(g)
mapped = graph.map({'a': [1, 2, 3]})
b = cb.Graph(nx.DiGraph()).map({'b': [11, 12]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you make b a completely new graph? What happens when you map graph over both 'a' and 'b' independently and use __setitem__?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the indices will not be in conflict, so that would be a different test?

In the current implementation I am not trying to detect compatible indices, so if naming clashes one would still get an error. I am adding a test showing/documenting this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this test is only about the index name not merging subgraphs that were mapped differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On closer look, I think there is a loophole someone in the __setitem__ logic around handling of mapped nodes. It is currently hidden by the rejection of compatible mappings, but might surface unnoticed if we implement that. Need to have a think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the test, and preventing setting mapped nodes (for now), and added new tests also for that.


@staticmethod
def from_array_like(values: Any, *, axis_zero: int = 0) -> ValueArray:
if hasattr(values, 'dims'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems risky to rely on some arbitrary attribute name. Can't you use isinstance checks instead? (The same applies to if hasattr(self._data_array, 'isel'): and if (columns := getattr(values, 'columns', None)) is not None: below.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use isinstance, since the libraries might not be installed. I could use a string comparison on the class name instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?

def _is_xarray_data_array(x: object) -> bool:
    try:
        import xarray as xr
    except ModuleNotFoundError:
        return False
    return isinstance(x, xr.DataArray)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See update.

values = self._data_array
for label, i in key:
# This is Scipp notation, Xarray uses the 'isel' method.
values = values[(label, i)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
values = values[(label, i)]
values = values[label, i]

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used an explicit (rather than implicit) tuple since a reader of Cyclebane might not be familiar with Scipp. I though the notation without only a comma might be more confusing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this code is comprehensible to people who are unfamiliar with Scipp in any case. But ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment.

src/cyclebane/node_values.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that the classes here use shape instead of sizes (as in Scipp). Don't the same design considerations apply here that led you to not allow positional indexing in Scipp?

Copy link
Member Author

@SimonHeybrock SimonHeybrock May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, when considering the inputs, then in several cases sizes or dims is not available. Named axes are only added by the adapter classes. I have not named it dims, but stuck closer to Pandas and used index_names. sizes would be dict(zip(shape, index_names)), and I have not needed it in the implementation yet. Did you have a particular code location in mind where it should be used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Named axes are only added by the adapter classes.

Exactly. I am asking about the interface of the adapters, not their implementation. I would have expected that the adapters require indexing by dim/index name.
I don't have a particular code location in mind. This is more about the general API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this now, as it is related to some problems (unsupported cases) in by_position.

IndexValue = Hashable


class ValueArray(ABC):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. But not easily extensible for users because the concrete type is selected within NodeValues._to_value_arrays. If this is a concern, I think you should add a mechanism to allow users to register their own ValueArrays. E.g.

_VALUE_ARRAY_IMPLEMENTATIONS: dict[type, type] = {}
try:
    import pandas as pd
    _VALUE_ARRAY_IMPLEMENTATIONS[pd.Series] = PandasSeriesAdapter
except ModuleNotFoundError:
    pass

# ...

def register_value_array_adapter(key: type, adapter: type) -> None:
    _VALUE_ARRAY_IMPLEMENTATIONS[key] = adapter

And then select an implementation based on the type. (This may need a fallback to SequenceAdapter if the type is not in the map because there is no unique key for sequences.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a concern for now, I'd think? Supporting Pandas, Numpy, list, Xarray, and Scipp should get us pretty far?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a registry implementation (different from suggestion) after all, while addressing the other comment in instance checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for all adapter types!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer adding this indirectly as tests of Graph. The adapters are kind of an implementation detail, as I see it. Would you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

src/cyclebane/graph.py Outdated Show resolved Hide resolved
src/cyclebane/graph.py Outdated Show resolved Hide resolved
@@ -358,7 +373,7 @@ def test_reduce_raises_if_new_node_name_exists() -> None:

graph = cb.Graph(g)
mapped = graph.map({'a': [1, 2, 3]})
with pytest.raises(ValueError):
with pytest.raises(ValueError, match="Node other already exists in the graph."):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(ValueError, match="Node other already exists in the graph."):
with pytest.raises(ValueError, match="Node 'other' already exists in the graph."):

(And in the implementation)
This threw me off on first reading.

Comment on lines 262 to 264
root_node_graph = nx.DiGraph()
root_node_graph.add_nodes_from(root_nodes)
graph = nx.compose(self.graph, root_node_graph)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
root_node_graph = nx.DiGraph()
root_node_graph.add_nodes_from(root_nodes)
graph = nx.compose(self.graph, root_node_graph)
graph = self.graph.copy()
graph.add_nodes_from(root_nodes)

This should be equivalent?

if graph.in_degree(root) > 0:
raise ValueError("Node is not a root node")
raise ValueError(f"Mapped node '{root}' is not a source node")
nodes = nx.dfs_successors(graph, root)
successors.update(
set(node for node_list in nodes.values() for node in node_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be something like:

successors.update(nx.descendants(G, source=root) | {root})

@SimonHeybrock
Copy link
Member Author

I think I have addressed the comments, please have another look!

try:
import pandas
except ModuleNotFoundError:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return False
return None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fix here and in other adapter.

@SimonHeybrock SimonHeybrock enabled auto-merge May 30, 2024 09:09
@SimonHeybrock SimonHeybrock merged commit 8028735 into main May 30, 2024
3 checks passed
@SimonHeybrock SimonHeybrock deleted the cleanup branch May 30, 2024 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants