Allow loading/download part of the data #545

xin-flex · 2022-10-08T22:56:21Z

Is your feature request related to a problem? Please describe.
The hdf5 data size is sometimes too large to fit into memory.

Describe the solution you'd like

Be able to load only part of the hdf5 data into Python environment (needed)
Be able to download only part of the hdf5 data from server (nice to have)

tylerflex · 2022-10-10T11:44:25Z

Note: in #535 you will be able to load an object stored within an hdf5 file (without loading everything) by supplying a path to from_hdf5() for example

flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/data/3/')

source

momchil-flex · 2022-10-10T17:34:48Z

Actually this is not exactly how this works, it is not exactly equivalent to the load_from_group / save_from_group in current develop. That is because current develop preserves the entire model structure in the hdf5, so if I for example save a SimulationData object that has a MonitorData, I can load the MonitorData from its corresponding group which has exactly the same structure/data as if I did MonitorData.to_hdf5.

In the reorg, only the json string of the model at the top level is stored. So in the SimulationData hdf5, only the SimulationData json is available, and you cannot load the MonitorData individually. The reason I introduced the group_path kwarg is so that you can store multiple models in the same file (something we do on the backend), e.g. something like

for monitor_data in monitor_data_list:
    monitor_data.to_hdf5("my.hdf5", group_path=monitor_data.monitor.name)

In this case, you can selectively load a single one of those from the "my.hdf5" file like you say, e.g. flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/flux_monitor/').

We should think about whether and how to handle this from a SimulationData file though.

momchil-flex · 2022-10-11T00:10:55Z

Note: in #535 you will be able to load an object stored within an hdf5 file (without loading everything) by supplying a path to from_hdf5() for example
flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/data/3/')

Note that your test works because you're loading a FluxDataArray which has a simple from_hdf5 method that directly loads the data only (it doesn't use Tidy3dBaseModel.from_hdf5). Still, it means that without us having to do anything, the user can fairly easily load DataArrays if not whole datasets.

tylerflex · 2022-10-11T08:57:10Z

So I just added a test that I think illustrates what you're saying in which we try to load a MonitorData directly out of a file containing a SimulationData. It indeed failed because it tries to use the SimulationData json to load the monitor data and ends up getting the group path all wrong.

Is this illustrative of the problem you are explaining above?

I fixed the test by adding something to dict_from_hdf5 in which we select the correct model_dict from the top level json string using the group_path

To illustrate the steps:

pulse = GaussianPulse.from_file('source.hdf5', group_path='/source_time')
# first, grab the `json_dict` for the source at the top level
# then access `json_dict[source_time]`
# also, access the hdf5 group `f_handle[source_time]`
# proceed as normal

See the changes in this commit

Hopefully this resolves at least part of the concern?

momchil-flex · 2022-10-11T23:12:45Z

I think this works now yeah.

The second (optional) request is to be able to only download part of the data. I think this may eventually be coupled with the denormalizer.

tylerflex · 2022-10-12T07:07:31Z

Yea, for download, we probably need changes to the web api for example?

tylerflex · 2023-02-09T15:48:02Z

@dbochkov-flexcompute any thoughts on this as part of the denormalizer efforts?

dbochkov-flexcompute · 2023-02-09T18:37:34Z

I guess I see these options so far:

Additionally to all smaller bits of data used in web UI, save data for each monitor into separate files, which could be downloaded if needed. However, it would increase storage usage by another ~100% in addtion to monitor_data.hdf5 (100%) + denormalized pieces (~100%), and probably used only in special situations.
Don't add anything, and just provide an option to download data stored in denormalized pieces. However, those pieces are highly specific, say, real part of Ex component of a field at specific frequency.
Similar to 2., but download all denormalized pieces related to a specific monitor, unpack them and merge into a full monitor data class. Potential issues I can see here: it could be a very large number of small files to download, and unpacking/merging pieces together would have to happen on user side

tylerflex · 2023-02-09T19:45:49Z

@xin-flex any thoughts?

xin-flex · 2023-03-17T00:48:11Z

I guess I see these options so far:

Additionally to all smaller bits of data used in web UI, save data for each monitor into separate files, which could be downloaded if needed. However, it would increase storage usage by another ~100% in addtion to monitor_data.hdf5 (100%) + denormalized pieces (~100%), and probably used only in special situations.

Don't add anything, and just provide an option to download data stored in denormalized pieces. However, those pieces are highly specific, say, real part of Ex component of a field at specific frequency.

Similar to 2., but download all denormalized pieces related to a specific monitor, unpack them and merge into a full monitor data class. Potential issues I can see here: it could be a very large number of small files to download, and unpacking/merging pieces together would have to happen on user side

For 1. the method to divide data needs to be predefined and lacks flexibility, and takes more storage
For 2. it is probably not very useful for the user
For 3. it seems most ideal, but is it possible to do the unpacking/merging on server side? (downloading a large number of small files probably has performance issue I think?)

dbochkov-flexcompute · 2023-03-18T01:16:45Z

For 3. it seems most ideal, but is it possible to do the unpacking/merging on server side? (downloading a large number of small files probably has performance issue I think?)

if we can do some processing on server side, then a simpler approach is probably to just open and save separately data for requested monitor from the non-denormalizer simulation data, something like:

mnt_data = td.SimulationData.from_file('simulation.hdf5', group_path='/data/3/')
mnt_data.to_file('mnt_data.hdf5')
# then download mnt_data.hdf5 to user and delete afterwards

xin-flex · 2023-03-20T19:56:59Z

agree

tylerflex · 2023-07-03T10:17:29Z

what's the status of this issue? are we still going to work on this? or saving for later?

xin-flex · 2023-07-13T22:48:10Z

I think this is still worthwhile to have, maybe not very urgent though

tylerflex · 2023-12-15T19:37:22Z

This is partially solved by #1249 . still not possible to download part of the hdf5 file, but I dont know if we want to allow that. should we close?

xin-flex assigned tylerflex Oct 8, 2022

tylerflex added the feature label Oct 9, 2022

momchil-flex mentioned this issue Oct 11, 2022

Data refactor updates #547

Closed

tylerflex assigned dbochkov-flexcompute and unassigned tylerflex Feb 9, 2023

tylerflex closed this as completed Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow loading/download part of the data #545

Allow loading/download part of the data #545

xin-flex commented Oct 8, 2022

tylerflex commented Oct 10, 2022

momchil-flex commented Oct 10, 2022

momchil-flex commented Oct 11, 2022 •

edited

Loading

tylerflex commented Oct 11, 2022 •

edited

Loading

momchil-flex commented Oct 11, 2022

tylerflex commented Oct 12, 2022

tylerflex commented Feb 9, 2023

dbochkov-flexcompute commented Feb 9, 2023

tylerflex commented Feb 9, 2023

xin-flex commented Mar 17, 2023 •

edited

Loading

dbochkov-flexcompute commented Mar 18, 2023

xin-flex commented Mar 20, 2023

tylerflex commented Jul 3, 2023

xin-flex commented Jul 13, 2023

tylerflex commented Dec 15, 2023

Allow loading/download part of the data #545

Allow loading/download part of the data #545

Comments

xin-flex commented Oct 8, 2022

tylerflex commented Oct 10, 2022

momchil-flex commented Oct 10, 2022

momchil-flex commented Oct 11, 2022 • edited Loading

tylerflex commented Oct 11, 2022 • edited Loading

momchil-flex commented Oct 11, 2022

tylerflex commented Oct 12, 2022

tylerflex commented Feb 9, 2023

dbochkov-flexcompute commented Feb 9, 2023

tylerflex commented Feb 9, 2023

xin-flex commented Mar 17, 2023 • edited Loading

dbochkov-flexcompute commented Mar 18, 2023

xin-flex commented Mar 20, 2023

tylerflex commented Jul 3, 2023

xin-flex commented Jul 13, 2023

tylerflex commented Dec 15, 2023

momchil-flex commented Oct 11, 2022 •

edited

Loading

tylerflex commented Oct 11, 2022 •

edited

Loading

xin-flex commented Mar 17, 2023 •

edited

Loading