Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve load-functionality: load multiple files into one DNDarray #900

Closed
2 of 6 tasks
coquelin77 opened this issue Jan 17, 2022 · 4 comments · Fixed by #1388
Closed
2 of 6 tasks

Improve load-functionality: load multiple files into one DNDarray #900

coquelin77 opened this issue Jan 17, 2022 · 4 comments · Fixed by #1388
Assignees
Labels

Comments

@coquelin77
Copy link
Member

coquelin77 commented Jan 17, 2022

Feature functionality
In HPC data analytics we often encounter the problem that there is not one large .h5-file to be processed, but instead many many single files (e.g., csv or images etc.). Therefore it is necessary to implement a load-routine for this situation, e.g., to implement a routine for loading multiple files into a single DNDarray (of course, some balancing has to be done at the end).

Note: Using the existing single-file load routines is not an option, because extensive stacking of DNDarrays along the split axis is very expensive!

Example for a scientific data set split into a plenty of files: http://cdn.gea.esac.esa.int/Gaia/

Some idea for the function signature:

load( foldername --> path of a folder containing multiple .csv, .npy, .h5 etc., 
          dtype, 
          balance --> (re)balance after loading, 
          split --> axis along which data is split (and thus concatenated), 
          device, 
          comm) 

pseudocode would be sth like this:

file_list = list all files contained in the directory foldernames as list of strings 
file_list = sort(file_list) # depending on argument "order"
local_file_list = part of file_list that belongs to current MPI-process 
local_array_list = [load(file).to(device) for file in local_file_list]
local_array = stack(local_array_list) 
array = DNDarray(local_array,...) 

Todos:

  • basic load functionality as described above for .npy-files
  • unittests for this
  • scaling tests (cluster?) and performance optimization
  • think about extension to .csv (?)
  • think about extension to images (?)
  • think about distribution of file list to processes depending on file sizes (?)
@ClaudiaComito ClaudiaComito added enhancement New feature or request I/O labels Apr 4, 2022
@mrfh92 mrfh92 self-assigned this Jun 23, 2023
@ClaudiaComito ClaudiaComito added good first issue Good for newcomers student project and removed good first issue Good for newcomers labels Aug 14, 2023
@mrfh92 mrfh92 changed the title Open multiple files into a single DNDarray Improve load-functionality: load multiple files into one DNDarray Aug 14, 2023
@mrfh92
Copy link
Collaborator

mrfh92 commented Aug 14, 2023

@krajsek (tagged you because I closed old #740 where you were assigned)

@mrfh92
Copy link
Collaborator

mrfh92 commented Aug 14, 2023

reviewed and updated within #1109

Copy link
Contributor

github-actions bot commented Nov 3, 2023

@LScheib LScheib removed their assignment Dec 21, 2023
@mrfh92 mrfh92 self-assigned this Jan 11, 2024
@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 11, 2024

(assigned to me as reservation for @Reisii)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants