Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow loading/download part of the data #545

Closed
xin-flex opened this issue Oct 8, 2022 · 15 comments
Closed

Allow loading/download part of the data #545

xin-flex opened this issue Oct 8, 2022 · 15 comments
Assignees

Comments

@xin-flex
Copy link
Contributor

xin-flex commented Oct 8, 2022

Is your feature request related to a problem? Please describe.
The hdf5 data size is sometimes too large to fit into memory.

Describe the solution you'd like

  • Be able to load only part of the hdf5 data into Python environment (needed)
  • Be able to download only part of the hdf5 data from server (nice to have)
@tylerflex
Copy link
Collaborator

Note: in #535 you will be able to load an object stored within an hdf5 file (without loading everything) by supplying a path to from_hdf5() for example

flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/data/3/')

source

@momchil-flex
Copy link
Collaborator

Actually this is not exactly how this works, it is not exactly equivalent to the load_from_group / save_from_group in current develop. That is because current develop preserves the entire model structure in the hdf5, so if I for example save a SimulationData object that has a MonitorData, I can load the MonitorData from its corresponding group which has exactly the same structure/data as if I did MonitorData.to_hdf5.

In the reorg, only the json string of the model at the top level is stored. So in the SimulationData hdf5, only the SimulationData json is available, and you cannot load the MonitorData individually. The reason I introduced the group_path kwarg is so that you can store multiple models in the same file (something we do on the backend), e.g. something like

for monitor_data in monitor_data_list:
    monitor_data.to_hdf5("my.hdf5", group_path=monitor_data.monitor.name)

In this case, you can selectively load a single one of those from the "my.hdf5" file like you say, e.g. flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/flux_monitor/').

We should think about whether and how to handle this from a SimulationData file though.

@momchil-flex
Copy link
Collaborator

momchil-flex commented Oct 11, 2022

Note: in #535 you will be able to load an object stored within an hdf5 file (without loading everything) by supplying a path to from_hdf5() for example

flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/data/3/')

Note that your test works because you're loading a FluxDataArray which has a simple from_hdf5 method that directly loads the data only (it doesn't use Tidy3dBaseModel.from_hdf5). Still, it means that without us having to do anything, the user can fairly easily load DataArrays if not whole datasets.

@tylerflex
Copy link
Collaborator

tylerflex commented Oct 11, 2022

So I just added a test that I think illustrates what you're saying in which we try to load a MonitorData directly out of a file containing a SimulationData. It indeed failed because it tries to use the SimulationData json to load the monitor data and ends up getting the group path all wrong.

Is this illustrative of the problem you are explaining above?

I fixed the test by adding something to dict_from_hdf5 in which we select the correct model_dict from the top level json string using the group_path

To illustrate the steps:

pulse = GaussianPulse.from_file('source.hdf5', group_path='/source_time')
# first, grab the `json_dict` for the source at the top level
# then access `json_dict[source_time]`
# also, access the hdf5 group `f_handle[source_time]`
# proceed as normal

See the changes in this commit

Hopefully this resolves at least part of the concern?

@momchil-flex
Copy link
Collaborator

I think this works now yeah.

The second (optional) request is to be able to only download part of the data. I think this may eventually be coupled with the denormalizer.

@tylerflex
Copy link
Collaborator

Yea, for download, we probably need changes to the web api for example?

@tylerflex
Copy link
Collaborator

@dbochkov-flexcompute any thoughts on this as part of the denormalizer efforts?

@dbochkov-flexcompute
Copy link
Contributor

I guess I see these options so far:

  1. Additionally to all smaller bits of data used in web UI, save data for each monitor into separate files, which could be downloaded if needed. However, it would increase storage usage by another ~100% in addtion to monitor_data.hdf5 (100%) + denormalized pieces (~100%), and probably used only in special situations.
  2. Don't add anything, and just provide an option to download data stored in denormalized pieces. However, those pieces are highly specific, say, real part of Ex component of a field at specific frequency.
  3. Similar to 2., but download all denormalized pieces related to a specific monitor, unpack them and merge into a full monitor data class. Potential issues I can see here: it could be a very large number of small files to download, and unpacking/merging pieces together would have to happen on user side

@tylerflex
Copy link
Collaborator

@xin-flex any thoughts?

@xin-flex
Copy link
Contributor Author

xin-flex commented Mar 17, 2023

I guess I see these options so far:

  1. Additionally to all smaller bits of data used in web UI, save data for each monitor into separate files, which could be downloaded if needed. However, it would increase storage usage by another ~100% in addtion to monitor_data.hdf5 (100%) + denormalized pieces (~100%), and probably used only in special situations.
  2. Don't add anything, and just provide an option to download data stored in denormalized pieces. However, those pieces are highly specific, say, real part of Ex component of a field at specific frequency.
  3. Similar to 2., but download all denormalized pieces related to a specific monitor, unpack them and merge into a full monitor data class. Potential issues I can see here: it could be a very large number of small files to download, and unpacking/merging pieces together would have to happen on user side

For 1. the method to divide data needs to be predefined and lacks flexibility, and takes more storage
For 2. it is probably not very useful for the user
For 3. it seems most ideal, but is it possible to do the unpacking/merging on server side? (downloading a large number of small files probably has performance issue I think?)

@dbochkov-flexcompute
Copy link
Contributor

For 3. it seems most ideal, but is it possible to do the unpacking/merging on server side? (downloading a large number of small files probably has performance issue I think?)

if we can do some processing on server side, then a simpler approach is probably to just open and save separately data for requested monitor from the non-denormalizer simulation data, something like:

mnt_data = td.SimulationData.from_file('simulation.hdf5', group_path='/data/3/')
mnt_data.to_file('mnt_data.hdf5')
# then download mnt_data.hdf5 to user and delete afterwards

@xin-flex
Copy link
Contributor Author

agree

@tylerflex
Copy link
Collaborator

what's the status of this issue? are we still going to work on this? or saving for later?

@xin-flex
Copy link
Contributor Author

I think this is still worthwhile to have, maybe not very urgent though

@tylerflex
Copy link
Collaborator

This is partially solved by #1249 . still not possible to download part of the hdf5 file, but I dont know if we want to allow that. should we close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants