Creating a new subject subdataset in a large superdataset takes a long time #5283

Hoda1394 · 2021-01-06T13:40:25Z

We are using datalad and datalad ukb extension to organize and arrange our raw data and projects from UK biobank. we are creating a superdataset with each subject as a subdataset that includes the relevant raw data. so the final superdataset is going to include around 500K subdataset. The issue I am encountering now is that creating a single new subdataset takes a long time (from 1 hr to 3 hrs) that makes the whole process very inefficient. right now, we have around 33K subdataset and it seems that by increasing the number of subjects the new subdataset creation time is increasing.
this is the first stage of creating the superdataset, so we are populating the subdatasets with already downloaded data and just moving the relevant data into the subdataset folders. Here are the commands I am using in the root of superdataset,

datalad create -d subdataset_path
datalad ukb-init -d subdataset_path ID fields
datalad ukb-update -k key -d subdataset_path
datalad save or datalad run to run all above commands

ukb-update uses a surrogate ukbfetch and there is no communication with the ukb server at this point. This whole process is done in the university's HPC system which I mentioned the information below.
I checked the time consumed for each step and the creation time is problematic part which is between 1 to 3 hours for a single subdataset right now (with 33k subdatasets). This time increased gradually as the whole dataset gets larger. My guess is, datalad tries to check some lock files in the large superdataset.
Generally, I want to know, is there a way to make the process more efficient? and what is the datalad team suggestion for such a project?
First, I raised the problem in NeuroStars which I mention here for the reference.

Some system and software info:
datalad version: 0.13.3
operating system: linux x86_64
distribution: CentOS Linux/7.7.1908/Core
filesystem: utf-8

The text was updated successfully, but these errors were encountered:

yarikoptic · 2021-01-06T15:33:06Z

upcoming 0.14.0 (I do not think those changes were propagated to maint releasing 0.13.x but you could try most recent release) will have substantial performance improvement for such use cases

Hoda1394 · 2021-01-06T18:05:16Z

Thanks, I will try the new release and let you know the results.

Hoda1394 · 2021-04-15T01:34:39Z

Sorry for my inactivity for a while. I tried our code with datalad version 0.14.1 and datalad-ukbiobank version 0.3.1. still encountering a long time adding an new subdataset. Actually, the whole process of adding one subdataset , initializing it and updating it takes around 48 min and 47 min is just for datalad create.
As I found out it can be because of git-annex that tries to check the unlock files.
I saw this: https://git-annex.branchable.com/todo/git_smudge_clean_interface_suboptiomal/ . but not sure such a thing can solve the issue. do you have any suggestions?

bpoldrack · 2021-04-16T00:58:40Z

Ping @mih

mih · 2021-04-16T04:59:27Z

Thx for the ping, but ATM I have nothing to more to say than what is in my initial response https://neurostars.org/t/creating-a-new-subject-subdataset-in-a-large-superdataset-takes-a-long-time/17869/2

I think it needs a demonstrator and some profiling to figure it out. The workflow seems to be sufficiently different from my own attempts (incremental rather than bulk addition of subdatasets) to require a solid investigation.

yarikoptic · 2021-04-16T14:48:19Z

@Hoda1394 , just to make sure -- it is datalad create -d subdataset_path (which should not add to current superdataset, I expect it to be fast - so did you miss . in your example) or datalad create -d . subdataset_path (which would add it to current superdataset) which takes long time?

Hoda1394 · 2021-04-17T04:30:31Z

@yarikoptic, yes it is datalad create -d subdataset_path that does not add the subdatasets to the superdataset root. then I am adding it by datalad save or using datalad run for the whole process. But it seems that even without adding this to the superdataset it is also be added as a submodule to git. I could see it added to .gitmodules while datalad subdatasets subdataset_path does not return anything before saving. so, I think that is why git-annex tries to check a long list for lock files at creation time.
I also checked the time with datalad create -d . subdataset_path. it increases to more than 2 hours.

@mih , would you explain a little your bulk addition workflow? it seems that we need to cut down on the number of individuals to the amount that we need the data to proceed, but it will be more than the amount that we have right now.

adswa · 2021-07-16T08:35:27Z

I recall us talking about this issue a few weeks ago in a video call. Have there been any updates, or further problems, @Hoda1394?

yarikoptic · 2021-07-16T18:05:16Z

I believe the issue is likely to be related to #5521 for which I have not finished my attempt at fixing it #5576 so while working out a fix might be worth to keep both use-cases in mind

yarikoptic added performance Improve performance of an existing feature UX user experience labels Jan 6, 2021

yarikoptic mentioned this issue Mar 20, 2021

save -d . subds/subsubds takes forever for no good reason #5521

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a new subject subdataset in a large superdataset takes a long time #5283

Creating a new subject subdataset in a large superdataset takes a long time #5283

Hoda1394 commented Jan 6, 2021

yarikoptic commented Jan 6, 2021

Hoda1394 commented Jan 6, 2021 •

edited

Loading

Hoda1394 commented Apr 15, 2021

bpoldrack commented Apr 16, 2021

mih commented Apr 16, 2021

yarikoptic commented Apr 16, 2021

Hoda1394 commented Apr 17, 2021 •

edited

Loading

adswa commented Jul 16, 2021

yarikoptic commented Jul 16, 2021

Creating a new subject subdataset in a large superdataset takes a long time #5283

Creating a new subject subdataset in a large superdataset takes a long time #5283

Comments

Hoda1394 commented Jan 6, 2021

yarikoptic commented Jan 6, 2021

Hoda1394 commented Jan 6, 2021 • edited Loading

Hoda1394 commented Apr 15, 2021

bpoldrack commented Apr 16, 2021

mih commented Apr 16, 2021

yarikoptic commented Apr 16, 2021

Hoda1394 commented Apr 17, 2021 • edited Loading

adswa commented Jul 16, 2021

yarikoptic commented Jul 16, 2021

Hoda1394 commented Jan 6, 2021 •

edited

Loading

Hoda1394 commented Apr 17, 2021 •

edited

Loading