Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File cache tree support #2678

Merged
merged 34 commits into from
Oct 5, 2022
Merged

File cache tree support #2678

merged 34 commits into from
Oct 5, 2022

Conversation

nellh
Copy link
Contributor

@nellh nellh commented Sep 29, 2022

This replaces the commit files cache that many operations depend on with a per git tree object cache that can be loaded in incrementally.

  • Implement a size resolver that doesn't depend on reading all trees - currently returns dummy values in some cases (128 bytes).
  • Add tests for file tree rendering step.
  • Restore previous sorting behavior to put files first and directories last.
  • Update CLI downloads to use the new tree API
  • Update browser downloads to use the new tree API
  • Show any failed files when downloading in the browser

This breaking API change for the worker moves to per git tree file requests instead of prefix slicing the entire file tree. This allows for more efficient top level requests and recursion by fetching smaller slices of the dataset as needed.
Adds better progress for file downloads as well.
@nellh nellh marked this pull request as ready for review October 3, 2022 22:43
@nellh nellh requested a review from rwblair October 3, 2022 22:43
@nellh
Copy link
Contributor Author

nellh commented Oct 3, 2022

One limitation here is this is returning the size reported by the validator again. I think we should address this on the validator side, it doesn't make sense for OpenNeuro to crawl the whole dataset just to report an accurate size when the validator has to do that anyways. I did try using git-annex info to get an estimate but this is still too slow for our largest datasets given storage performance available.

@nellh nellh force-pushed the datalad-service/tree-based-files branch from f8c1774 to 5f11372 Compare October 4, 2022 15:41
await body.pipeThrough(progress).pipeTo(writable)
} else {
apmTransaction.captureError(statusText)
return requestFailureToast()
Copy link
Collaborator

@rwblair rwblair Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the following accurate?: In earlier versions if a file failed the entire download would return then and there, where as this version should only result in the current directory download being abandoned and all other paths will still be downloaded. If so I could imagine a max failed download number to have downloadNative bail out on could be a thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. You can cancel the entire download by closing the toast but otherwise I think we want to try to get as much as possible, so that if something is missed you can resume with a second attempt and only need a small subset to get a complete dataset.

That brings up a fix here though, I think this should show you if any files failed due to network or other interruptions at the end and suggest a retry. Otherwise it's easy to assume you have the whole dataset when one file is missing.

@@ -16,52 +15,42 @@ def __init__(self, store):
self.store = store
self.logger = logging.getLogger('datalad_service.' + __name__)

def on_get(self, req, resp, dataset, filename=None, snapshot='HEAD'):
def on_get(self, req, resp, dataset, filename, snapshot='HEAD'):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was overloaded so that the files request with an argument would return a file but without the file argument it would return the listing. The change here is to make filename required and move all the listing functionality to the TreeResource class.

@codecov
Copy link

codecov bot commented Oct 4, 2022

Codecov Report

Merging #2678 (ecf0804) into master (e449cfe) will decrease coverage by 0.10%.
The diff coverage is 59.45%.

@@            Coverage Diff             @@
##           master    #2678      +/-   ##
==========================================
- Coverage   39.66%   39.56%   -0.11%     
==========================================
  Files         561      562       +1     
  Lines       34273    34282       +9     
  Branches      910      903       -7     
==========================================
- Hits        13594    13563      -31     
- Misses      20554    20593      +39     
- Partials      125      126       +1     
Impacted Files Coverage Δ
packages/openneuro-app/src/scripts/apm.js 66.66% <ø> (-7.25%) ⬇️
...app/src/scripts/dataset/download/download-query.js 15.38% <0.00%> (-1.29%) ⬇️
...es/openneuro-app/src/scripts/types/dataset-file.ts 0.00% <0.00%> (ø)
packages/openneuro-cli/src/actions.js 0.00% <0.00%> (ø)
packages/openneuro-cli/src/datasets.js 61.42% <0.00%> (-1.81%) ⬇️
...ckages/openneuro-server/src/datalad/description.js 79.47% <ø> (+0.52%) ⬆️
packages/openneuro-server/src/datalad/draft.js 50.00% <ø> (+3.06%) ⬆️
...es/openneuro-server/src/graphql/resolvers/draft.js 0.00% <0.00%> (ø)
packages/openneuro-server/src/graphql/schema.js 0.00% <0.00%> (ø)
.../datalad/datalad_service/handlers/annex_objects.py 53.33% <0.00%> (+7.50%) ⬆️
... and 30 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@rwblair
Copy link
Collaborator

rwblair commented Oct 5, 2022

Does this call to get_repo_files need to be amended, wasn't sure if snapshot there was treeish:

files = get_repo_files(dataset_path, snapshot)

* output of above used by pygit2 repo.index.remove_all(...) which is a recursive delete.

@nellh
Copy link
Contributor Author

nellh commented Oct 5, 2022

Does this call to get_repo_files need to be amended, wasn't sure if snapshot there was treeish:

files = get_repo_files(dataset_path, snapshot)

The snapshot value here is the tag (or the commit hash, either way) and will return the commit tree.

@nellh
Copy link
Contributor Author

nellh commented Oct 5, 2022

Got it, fixed in b9e1243

Comment on lines 57 to 58
if filename == '.gitattributes':
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will now be covered by L53. Are we happy to also be skipping .gitmodules, .gitignore, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just affects if OpenNeuro displays these, so do we want to display .gitignore? We don't allow upload of it but it could exist via git push.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to exclude from display. We should probably reject git pushes with gitignores until we drop the working tree.

@nellh nellh merged commit 52fd570 into master Oct 5, 2022
@nellh nellh deleted the datalad-service/tree-based-files branch October 5, 2022 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants