You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature functionality
In HPC data analytics we often encounter the problem that there is not one large .h5-file to be processed, but instead many many single files (e.g., csv or images etc.). Therefore it is necessary to implement a load-routine for this situation, e.g., to implement a routine for loading multiple files into a single DNDarray (of course, some balancing has to be done at the end).
Note: Using the existing single-file load routines is not an option, because extensive stacking of DNDarrays along the split axis is very expensive!
load( foldername --> path of a folder containing multiple .csv, .npy, .h5 etc.,
dtype,
balance --> (re)balance after loading,
split --> axis along which data is split (and thus concatenated),
device,
comm)
pseudocode would be sth like this:
file_list = list all files contained in the directory foldernames as list of strings
file_list = sort(file_list) # depending on argument "order"
local_file_list = part of file_list that belongs to current MPI-process
local_array_list = [load(file).to(device) for file in local_file_list]
local_array = stack(local_array_list)
array = DNDarray(local_array,...)
Todos:
basic load functionality as described above for .npy-files
unittests for this
scaling tests (cluster?) and performance optimization
think about extension to .csv (?)
think about extension to images (?)
think about distribution of file list to processes depending on file sizes (?)
The text was updated successfully, but these errors were encountered:
Feature functionality
In HPC data analytics we often encounter the problem that there is not one large .h5-file to be processed, but instead many many single files (e.g., csv or images etc.). Therefore it is necessary to implement a load-routine for this situation, e.g., to implement a routine for loading multiple files into a single DNDarray (of course, some balancing has to be done at the end).
Note: Using the existing single-file load routines is not an option, because extensive stacking of DNDarrays along the split axis is very expensive!
Example for a scientific data set split into a plenty of files: http://cdn.gea.esac.esa.int/Gaia/
Some idea for the function signature:
pseudocode would be sth like this:
Todos:
The text was updated successfully, but these errors were encountered: