-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating a new subject subdataset in a large superdataset takes a long time #5283
Comments
upcoming 0.14.0 (I do not think those changes were propagated to maint releasing 0.13.x but you could try most recent release) will have substantial performance improvement for such use cases |
Thanks, I will try the new release and let you know the results. |
Sorry for my inactivity for a while. I tried our code with datalad version 0.14.1 and datalad-ukbiobank version 0.3.1. still encountering a long time adding an new subdataset. Actually, the whole process of adding one subdataset , initializing it and updating it takes around 48 min and 47 min is just for |
Ping @mih |
Thx for the ping, but ATM I have nothing to more to say than what is in my initial response https://neurostars.org/t/creating-a-new-subject-subdataset-in-a-large-superdataset-takes-a-long-time/17869/2 I think it needs a demonstrator and some profiling to figure it out. The workflow seems to be sufficiently different from my own attempts (incremental rather than bulk addition of subdatasets) to require a solid investigation. |
@Hoda1394 , just to make sure -- it is |
@yarikoptic, yes it is @mih , would you explain a little your bulk addition workflow? it seems that we need to cut down on the number of individuals to the amount that we need the data to proceed, but it will be more than the amount that we have right now. |
I recall us talking about this issue a few weeks ago in a video call. Have there been any updates, or further problems, @Hoda1394? |
We are using
datalad
anddatalad ukb
extension to organize and arrange our raw data and projects from UK biobank. we are creating a superdataset with each subject as a subdataset that includes the relevant raw data. so the final superdataset is going to include around 500K subdataset. The issue I am encountering now is that creating a single new subdataset takes a long time (from 1 hr to 3 hrs) that makes the whole process very inefficient. right now, we have around 33K subdataset and it seems that by increasing the number of subjects the new subdataset creation time is increasing.this is the first stage of creating the superdataset, so we are populating the subdatasets with already downloaded data and just moving the relevant data into the subdataset folders. Here are the commands I am using in the root of superdataset,
datalad create -d subdataset_path
datalad ukb-init -d subdataset_path ID fields
datalad ukb-update -k key -d subdataset_path
datalad save
ordatalad run
to run all above commandsukb-update uses a surrogate
ukbfetch
and there is no communication with the ukb server at this point. This whole process is done in the university's HPC system which I mentioned the information below.I checked the time consumed for each step and the creation time is problematic part which is between 1 to 3 hours for a single subdataset right now (with 33k subdatasets). This time increased gradually as the whole dataset gets larger. My guess is, datalad tries to check some lock files in the large superdataset.
Generally, I want to know, is there a way to make the process more efficient? and what is the datalad team suggestion for such a project?
First, I raised the problem in NeuroStars which I mention here for the reference.
Some system and software info:
datalad version: 0.13.3
operating system: linux x86_64
distribution: CentOS Linux/7.7.1908/Core
filesystem: utf-8
The text was updated successfully, but these errors were encountered: