Improve load-functionality: load multiple files into one DNDarray #900

coquelin77 · 2022-01-17T08:52:03Z

Feature functionality
In HPC data analytics we often encounter the problem that there is not one large .h5-file to be processed, but instead many many single files (e.g., csv or images etc.). Therefore it is necessary to implement a load-routine for this situation, e.g., to implement a routine for loading multiple files into a single DNDarray (of course, some balancing has to be done at the end).

Note: Using the existing single-file load routines is not an option, because extensive stacking of DNDarrays along the split axis is very expensive!

Example for a scientific data set split into a plenty of files: http://cdn.gea.esac.esa.int/Gaia/

Some idea for the function signature:

load( foldername --> path of a folder containing multiple .csv, .npy, .h5 etc., 
          dtype, 
          balance --> (re)balance after loading, 
          split --> axis along which data is split (and thus concatenated), 
          device, 
          comm)

pseudocode would be sth like this:

file_list = list all files contained in the directory foldernames as list of strings 
file_list = sort(file_list) # depending on argument "order"
local_file_list = part of file_list that belongs to current MPI-process 
local_array_list = [load(file).to(device) for file in local_file_list]
local_array = stack(local_array_list) 
array = DNDarray(local_array,...)

Todos:

basic load functionality as described above for .npy-files
unittests for this
scaling tests (cluster?) and performance optimization
think about extension to .csv (?)
think about extension to images (?)
think about distribution of file list to processes depending on file sizes (?)

The text was updated successfully, but these errors were encountered:

mrfh92 · 2023-08-14T08:31:28Z

@krajsek (tagged you because I closed old #740 where you were assigned)

mrfh92 · 2023-08-14T08:31:56Z

reviewed and updated within #1109

github-actions · 2023-11-03T16:16:53Z

Branch features/900-Improve_load-functionality_load_multiple_files_into_one_DNDarray created!

mrfh92 · 2024-01-11T15:21:36Z

(assigned to me as reservation for @Reisii)

ClaudiaComito added enhancement New feature or request I/O labels Apr 4, 2022

mrfh92 self-assigned this Jun 23, 2023

ClaudiaComito unassigned mrfh92 Aug 14, 2023

ClaudiaComito added good first issue Good for newcomers student project and removed good first issue Good for newcomers labels Aug 14, 2023

mrfh92 mentioned this issue Aug 14, 2023

Improved dataloaders to work more in line with torch's #740

Closed

mrfh92 changed the title ~~Open multiple files into a single DNDarray~~ Improve load-functionality: load multiple files into one DNDarray Aug 14, 2023

mrfh92 mentioned this issue Aug 21, 2023

numpy.loadtxt vs heat.load_csv #814

Closed

LScheib self-assigned this Nov 3, 2023

LScheib removed their assignment Dec 21, 2023

mrfh92 self-assigned this Jan 11, 2024

mrfh92 assigned Reisii Feb 21, 2024

Reisii mentioned this issue Feb 28, 2024

Load functionality for multiple .npy files #1388

Merged

4 tasks

mrfh92 closed this as completed in #1388 Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve load-functionality: load multiple files into one DNDarray #900

Improve load-functionality: load multiple files into one DNDarray #900

coquelin77 commented Jan 17, 2022 •

edited by Reisii

Loading

mrfh92 commented Aug 14, 2023

mrfh92 commented Aug 14, 2023

github-actions bot commented Nov 3, 2023

mrfh92 commented Jan 11, 2024

Improve load-functionality: load multiple files into one DNDarray #900

Improve load-functionality: load multiple files into one DNDarray #900

Comments

coquelin77 commented Jan 17, 2022 • edited by Reisii Loading

mrfh92 commented Aug 14, 2023

mrfh92 commented Aug 14, 2023

github-actions bot commented Nov 3, 2023

mrfh92 commented Jan 11, 2024

coquelin77 commented Jan 17, 2022 •

edited by Reisii

Loading