Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a new subject subdataset in a large superdataset takes a long time #5283

Open
Hoda1394 opened this issue Jan 6, 2021 · 9 comments
Labels
performance Improve performance of an existing feature UX user experience

Comments

@Hoda1394
Copy link

Hoda1394 commented Jan 6, 2021

We are using datalad and datalad ukb extension to organize and arrange our raw data and projects from UK biobank. we are creating a superdataset with each subject as a subdataset that includes the relevant raw data. so the final superdataset is going to include around 500K subdataset. The issue I am encountering now is that creating a single new subdataset takes a long time (from 1 hr to 3 hrs) that makes the whole process very inefficient. right now, we have around 33K subdataset and it seems that by increasing the number of subjects the new subdataset creation time is increasing.
this is the first stage of creating the superdataset, so we are populating the subdatasets with already downloaded data and just moving the relevant data into the subdataset folders. Here are the commands I am using in the root of superdataset,

datalad create -d subdataset_path
datalad ukb-init -d subdataset_path ID fields
datalad ukb-update -k key -d subdataset_path
datalad save or datalad run to run all above commands

ukb-update uses a surrogate ukbfetch and there is no communication with the ukb server at this point. This whole process is done in the university's HPC system which I mentioned the information below.
I checked the time consumed for each step and the creation time is problematic part which is between 1 to 3 hours for a single subdataset right now (with 33k subdatasets). This time increased gradually as the whole dataset gets larger. My guess is, datalad tries to check some lock files in the large superdataset.
Generally, I want to know, is there a way to make the process more efficient? and what is the datalad team suggestion for such a project?
First, I raised the problem in NeuroStars which I mention here for the reference.

Some system and software info:
datalad version: 0.13.3
operating system: linux x86_64
distribution: CentOS Linux/7.7.1908/Core
filesystem: utf-8

@yarikoptic
Copy link
Member

upcoming 0.14.0 (I do not think those changes were propagated to maint releasing 0.13.x but you could try most recent release) will have substantial performance improvement for such use cases

@yarikoptic yarikoptic added performance Improve performance of an existing feature UX user experience labels Jan 6, 2021
@Hoda1394
Copy link
Author

Hoda1394 commented Jan 6, 2021

Thanks, I will try the new release and let you know the results.

@Hoda1394
Copy link
Author

Sorry for my inactivity for a while. I tried our code with datalad version 0.14.1 and datalad-ukbiobank version 0.3.1. still encountering a long time adding an new subdataset. Actually, the whole process of adding one subdataset , initializing it and updating it takes around 48 min and 47 min is just for datalad create.
As I found out it can be because of git-annex that tries to check the unlock files.
I saw this: https://git-annex.branchable.com/todo/git_smudge_clean_interface_suboptiomal/ . but not sure such a thing can solve the issue. do you have any suggestions?

@bpoldrack
Copy link
Member

Ping @mih

@mih
Copy link
Member

mih commented Apr 16, 2021

Thx for the ping, but ATM I have nothing to more to say than what is in my initial response https://neurostars.org/t/creating-a-new-subject-subdataset-in-a-large-superdataset-takes-a-long-time/17869/2

I think it needs a demonstrator and some profiling to figure it out. The workflow seems to be sufficiently different from my own attempts (incremental rather than bulk addition of subdatasets) to require a solid investigation.

@yarikoptic
Copy link
Member

@Hoda1394 , just to make sure -- it is datalad create -d subdataset_path (which should not add to current superdataset, I expect it to be fast - so did you miss . in your example) or datalad create -d . subdataset_path (which would add it to current superdataset) which takes long time?

@Hoda1394
Copy link
Author

Hoda1394 commented Apr 17, 2021

@yarikoptic, yes it is datalad create -d subdataset_path that does not add the subdatasets to the superdataset root. then I am adding it by datalad save or using datalad run for the whole process. But it seems that even without adding this to the superdataset it is also be added as a submodule to git. I could see it added to .gitmodules while datalad subdatasets subdataset_path does not return anything before saving. so, I think that is why git-annex tries to check a long list for lock files at creation time.
I also checked the time with datalad create -d . subdataset_path. it increases to more than 2 hours.

@mih , would you explain a little your bulk addition workflow? it seems that we need to cut down on the number of individuals to the amount that we need the data to proceed, but it will be more than the amount that we have right now.

@adswa
Copy link
Member

adswa commented Jul 16, 2021

I recall us talking about this issue a few weeks ago in a video call. Have there been any updates, or further problems, @Hoda1394?

@yarikoptic
Copy link
Member

I believe the issue is likely to be related to #5521 for which I have not finished my attempt at fixing it #5576 so while working out a fix might be worth to keep both use-cases in mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature UX user experience
Projects
None yet
Development

No branches or pull requests

5 participants