Data refactor updates #547

momchil-flex · 2022-10-11T00:29:49Z

Fixed:

Using xr.load_dataarray instead of xr.open_dataarray so as not to block the file (also modified test to check for that).
Automatically add / in the beginning of group_path for convenience.
Model json is written in the group_path rather than at the base level of the file. This is in view of writing multiple models to the same file (if we won't be doing that, specifying a group_path is a bit pointless to begin with...)

Remaining issues I didn't have time for / would like to discuss:

hdf5 file is currently always opened in write mode. This prevents storing multiple models in different groups in the same file (see last point above). We don't necessarily have to support this; it's just convenient. It could also provide a way to fully load an individual monitor data re Allow loading/download part of the data #545 as opposed to individual data arrays only (what I'm thinking about is modifying the SimulationData.to_hdf5 file a bit to call each of the MonitorData-s to_hdf5 with the corresponding group name). But basically I see three options:
- Add a "mode" kwarg to to_hdf5 similar to xarray to_netcdf.
- Always open in write mode if no custom group path provided, and in append mode if custom group path provided.
- Just scrap this altogether, in which case I think we can also remove the group_path kwarg since like I say above if we can't store more than one model in the same file the group_path is not very useful.
Some issues with inheritance and returns of methods. Namely, I can for example call FieldDataArray.interp, but this returns a regular xr.DataArray. Similarly, I can call ModeSolverData.sel_mode_index, but this returns a FieldDataset instead of FieldData. So:
- Do we leave this as is and just handle manually if different behavior is needed?
- Do we make wrappers for some of the methods that we use, e.g. FieldDataArray.interp that calls the xarray interp but returns a FieldDataArray?
- Do we wrap sel_mode_index in ModeSolverData, leave it as is (only in ModeSolverDataset), or remove altogether? On some level I feel like we should either have a full-fledged sel wrapper or nothing at all. But it is convenient sometimes.

tylerflex · 2022-10-11T08:22:54Z

I just pushed a commit to this branch to refactor things a tiny bit.

Changes look good. Thanks for working on them. A comment:

Handling the leading slash automatically in group_path is probably ok, although in the docstring probably we should mention that the leading forward slash is optional.

To respond to your questions:

hdf5 file is currently always opened in write mode. This prevents storing multiple models in different groups in the same file (see last point above). We don't necessarily have to support this; it's just convenient. It could also provide a way to fully load an individual monitor data re #545 as opposed to individual data arrays only (what I'm thinking about is modifying the SimulationData.to_hdf5 file a bit to call each of the MonitorData-s to_hdf5 with the corresponding group name). But basically I see three options:

Add a "mode" kwarg to to_hdf5 similar to xarray to_netcdf.

Always open in write mode if no custom group path provided, and in append mode if custom group path provided.

Just scrap this altogether, in which case I think we can also remove the group_path kwarg since like I say above if we can't store more than one model in the same file the group_path is not very useful.

Would it be possible just change the h5py.File() mode from "w" to "a" and have it work the way we need? Would the only issue here being if the user actually wants to overwrite the file after writing to it, they would need to do something special?

I agree we should support writing separately to the same file, (re #545) at least eventually. If there is not a big need for it now, we could try to remove group_path all together and keep "w" mode. And the later work on features related to #545. Otherwise, maybe adding the mode (default "a" and pass to h5py.File() in to_hdf5() is the most generalizable way to handle it? but does require a little more thought for the user. I'm fine with this change though, as we can just write good documentation to explain when to use each mode and leave the most common case as the default.

Some issues with inheritance and returns of methods. Namely, I can for example call FieldDataArray.interp, but this returns a regular xr.DataArray. Similarly, I can call ModeSolverData.sel_mode_index, but this returns a FieldDataset instead of FieldData. So:

Yea, more generally the inheritance regarding Dataset and MonitorData did not go as smoothly as I expected.

A lot of the functionality is split across the two files, even though the MonitorData should work as before and the Datasets still have some limited functionality, it's quite a bit harder to reason about all of it as a developer. It also means that adding a new datastructure will require writing two classes instead of one.
The MonitorData objects inevitably had to inherit from both their Dataset counterparts and their MonitorData superclasses, which adds to the complication in reasoning about what methods they have.
Some of the methods, like sel_mode_index() had to return more "pure" objects, like Datasets because it wasn't immediately clear how to have them return MonitorData objects. More discussion below.

Regarding DataArray, I didnt realize until now but it does look like DataArray.sel returns xr.DataArray. This is probably expected and actually makes sense if we define specific data Array objects by their dims. I dont think it should affect anything practically at least at this stage.

Do we make wrappers for some of the methods that we use, e.g. FieldDataArray.interp that calls the xarray interp but returns a FieldDataArray?

Don't see a need to change it personally. The DataArray subclasses are mostly templates for making sure these objects have well defined dims. After we modify them, they lose their meaning in that regard. Basically I kind of think DataArray.sel -> xr.DataArray is expected behavior on some level, so I'm ok with it. It also significantly cleans up the DataArray internals to have it set up how it is now (not hacking the init method). Adding wrappers to DataArray is possible but then complicates things, so I'm leaning towards not doing that until we have a clear reason to.

Do we wrap sel_mode_index in ModeSolverData, leave it as is (only in ModeSolverDataset), or remove altogether? On some level I feel like we should either have a full-fledged sel wrapper or nothing at all. But it is convenient sometimes.

I think we could implement a wrapper in MonitorData if we want to, but then note we'll also need to define a fictitious FieldMonitor, which doesn't seem very natural. We did it before, but only out of necessity. Maybe a FieldDataset makes more sense here, although we would lose, say, the ability to unroll symmetry on that object so I'm not sure if that's a deal breaker. We could unroll the symmetry before selecting but then there's the data storage issue. If you think it would come in handy to have this wrapper, we can add it without issue, just need to make some arbitrary choices on the FieldMonitor, such as its name, (although we could just copy the name from the original monitor maybe). Another option would be to simply return a ModeSolverMonitor with all of the mode_index dims of single coordinate.

Maybe writing our own wrappers for sel etc. is the way to go eventually. We could add these at both the MonitorData and DataArray levels. We could control things a lot more generally, for example:

how to more gracefully handle interpolating with a single coordinate in a dimension. or interpolating out of bounds.
MonitorData.sel could potentially generalize sel_mode_index() to selecting across all Field objects in the object. (What would the return class be though?)

My opinion is maybe that we should limit the scope of this data refactor to try to merge in the improvements and simplifications we have before trying to add more to it, otherwise it might never get merged in. So perhaps in that spirit, I could see working on the following things at a later time (even soon after merging the refactor PR), unless there's a strong need for them now:

saving and accessing from subgroups in the hdf5 file
custom wrappers for data selection.
The tother thing I'm not 100% happy with right now is the dataset / monitor data inheritance stuff is a bit confusing. Not sure if you have thoughts on that.

Feel free to merge this into feature/data_refactor_1_8 btw.

momchil-flex · 2022-10-11T23:49:25Z

I pushed another update. With this and my fixes to the backend, most backend tests now pass (hopefully fixing the last fews that don't will only require backend fixes). I kept this PR open for you to have a look one last time though.

I just pushed a commit to this branch to refactor things a tiny bit.

Ugh... I just now realized this and I have lost those changes. Earlier, I pulled your branch, which had conflicts with this branch because you also made some commits there yesterday. Locally, I rebased my branch against your branch, and force pushed it here. So I don't know what was in that commit but it's lost now, but you should still have it locally. See if there's something you still want to change.

Handling the leading slash automatically in group_path is probably ok, although in the docstring probably we should mention that the leading forward slash is optional.

I added to the docstring.

I agree we should support writing separately to the same file, (re #545) at least eventually. If there is not a big need for it now, we could try to remove group_path all together and keep "w" mode. And the later work on features related to #545. Otherwise, maybe adding the mode (default "a" and pass to h5py.File() in to_hdf5() is the most generalizable way to handle it? but does require a little more thought for the user. I'm fine with this change though, as we can just write good documentation to explain when to use each mode and leave the most common case as the default.

So you had added the commit referenced in #545 which allows for separate MonitorData to be loaded from a simulation data file. However this was conflicting with my updates, because it was still assuming that the json is written at the base level, while in my (previous) branch, the json was written at the group_path level. I decided to refactor around your approach though, because it's clean-ish and provides enough functionality I think. In summary:

to_hdf5 will always overwrite the file and write the json of the model at the base level. So we cannot write multiple models to the same file. Because of this, I also removed the group_path kwarg in to_hdf5 as it's not really useful.
from_hdf5 can selectively load a sub-element of the model given a group_path as per your update (I moved the test to the sim_data test file).
To write multiple models to file (something I do on the backend), one could just create a simple container class and use to_hdf5 on that one.

Some issues with inheritance and returns of methods. Namely, I can for example call FieldDataArray.interp, but this returns a regular xr.DataArray. Similarly, I can call ModeSolverData.sel_mode_index, but this returns a FieldDataset instead of FieldData. So:

Yea, more generally the inheritance regarding Dataset and MonitorData did not go as smoothly as I expected.

Summary of my opinions and changes regarding this whole issue:

We don't wrap any xr.DataArray methods. DataArray.sel, DataArray.interp, etc. return xr.DataArray and it's just something we have to live with.
Ideally, Dataset methods shouldn't return a new type of Dataset/DataArray. I think this was only happening in sel_mode_index. I have updated that method and now instead of constructing a FieldDataset for a ModeSolverDataset, it's just a thin wrapper on applying .sel(mode_index=[mode_index]) on all field components. So this simply returns a ModeSolverDataset if called from a ModeSolverDataset, and a ModeFieldData if called from a ModeSolverData (note that the square brackets in mode_index=[mode_index] ensure that the dimension is preserved).

tylerflex · 2022-10-12T07:36:26Z

No worries on losing the commit, it was pretty simple. I was able to recover it locally and then git cherry-pick it into the feature/data_refactor_1_8. Here are the changes.

Handling the leading slash automatically in group_path is probably ok, although in the docstring probably we should mention that the leading forward slash is optional.
I added to the docstring.

I wonder if it makes sense to do the group_path a little differently, as a group_path : List[str] = None.
So if one wanted to grab, say the field data from a sim_data file, it would be something like

FieldData.from_hdf5('sim_data.hdf5', group_path=['data', 'data_0'])

instead of

FieldData.from_hdf5('sim_data.hdf5', group_path='data/data_0'])

would make things cleaner on the backend and possibly front end, no ambiguity or string parsing.. Just a thought.

to_hdf5 will always overwrite the file and write the json of the model at the base level. So we cannot write multiple models to the same file. Because of this, I also removed the group_path kwarg in to_hdf5 as it's not really useful.
from_hdf5 can selectively load a sub-element of the model given a group_path as per your update (I moved the test to the sim_data test file).
To write multiple models to file (something I do on the backend), one could just create a simple container class and use to_hdf5 on that one.

This sounds fine to me. we can always change things later too to support more complex behavior.

Regarding your opinions on the inheritance:

We don't wrap any xr.DataArray methods. DataArray.sel, DataArray.interp, etc. return xr.DataArray and it's just something we have to live with.

Agree. Until we need something more sophisticated, let's not mess with it.

Ideally, Dataset methods shouldn't return a new type of Dataset/DataArray. I think this was only happening in sel_mode_index. I have updated that method and now instead of constructing a FieldDataset for a ModeSolverDataset,
it's just a thin wrapper on applying .sel(mode_index=[mode_index]) on all field components. So this simply returns a ModeSolverDataset if called from a ModeSolverDataset, and a ModeFieldData if called from a ModeSolverData (note that the square brackets in mode_index=[mode_index] ensure that the dimension is preserved).

This sounds good to me.

Well, sounds like things are shaping up. Should we go ahead and merge your PR into feature/data_refactor_1_8? or do you want to continue working on it separately?

Pending decisions / questions seem to be

Dataset -> MonitorData inheritance is a bit clunky.. maybe we should just live with it at this point.
SimulationData.data keep as tuple, or make dict?
change how user specifies group_path?

tylerflex and others added 12 commits October 8, 2022 10:27

Combine Near2Far with monitor data, put near2far.py into components

f5f8bfd

moved Symmetry to types.py

79b2adc

reorganized data tests

6c258fc

simplified DataArray inheritance from xarray

a28605c

created Dataset abstraction

6572f99

made SimulationData.monitor_data a property, added .data tuple

ad073f7

improved test coverage and organization

b63a752

fixed updater, refactor loading from file

8597a71

fixed errors introduced by fixing merge conflicts

1ee6b1f

Adding group_path as kwarg to emulate save/load from group

c77257c

fixed errors, custom error message if dataarray didnt load.

1efd514

add warning when saving to file with data

0d621d9

momchil-flex changed the title ~~Some updates~~ Data refactor updates Oct 11, 2022

add test for group loading in sim_data

ee1ae28

tylerflex force-pushed the feature/data_refactor_1_8 branch from 333405a to ee1ae28 Compare October 11, 2022 09:46

Some updates

bacb192

momchil-flex force-pushed the momchil/data_refactor_1_8 branch from c1149ee to 0179d12 Compare October 11, 2022 22:39

More updates

26f7441

momchil-flex force-pushed the momchil/data_refactor_1_8 branch from 0179d12 to 26f7441 Compare October 11, 2022 22:53

tylerflex force-pushed the feature/data_refactor_1_8 branch 2 times, most recently from a68e7d5 to c592a11 Compare October 14, 2022 09:32

momchil-flex closed this Oct 14, 2022

tylerflex deleted the momchil/data_refactor_1_8 branch May 16, 2023 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data refactor updates #547

Data refactor updates #547

momchil-flex commented Oct 11, 2022 •

edited

Loading

tylerflex commented Oct 11, 2022 •

edited

Loading

momchil-flex commented Oct 11, 2022

tylerflex commented Oct 12, 2022 •

edited

Loading

Data refactor updates #547

Data refactor updates #547

Conversation

momchil-flex commented Oct 11, 2022 • edited Loading

tylerflex commented Oct 11, 2022 • edited Loading

momchil-flex commented Oct 11, 2022

tylerflex commented Oct 12, 2022 • edited Loading

momchil-flex commented Oct 11, 2022 •

edited

Loading

tylerflex commented Oct 11, 2022 •

edited

Loading

tylerflex commented Oct 12, 2022 •

edited

Loading