File cache tree support #2678

nellh · 2022-09-29T23:39:08Z

This replaces the commit files cache that many operations depend on with a per git tree object cache that can be loaded in incrementally.

Implement a size resolver that doesn't depend on reading all trees - currently returns dummy values in some cases (128 bytes).
Add tests for file tree rendering step.
Restore previous sorting behavior to put files first and directories last.
Update CLI downloads to use the new tree API
Update browser downloads to use the new tree API
Show any failed files when downloading in the browser

This breaking API change for the worker moves to per git tree file requests instead of prefix slicing the entire file tree. This allows for more efficient top level requests and recursion by fetching smaller slices of the dataset as needed.

…ree APIs

Adds better progress for file downloads as well.

nellh · 2022-10-03T22:45:23Z

One limitation here is this is returning the size reported by the validator again. I think we should address this on the validator side, it doesn't make sense for OpenNeuro to crawl the whole dataset just to report an accurate size when the validator has to do that anyways. I did try using git-annex info to get an estimate but this is still too slow for our largest datasets given storage performance available.

…files

packages/openneuro-app/src/scripts/dataset/download/download-native.js

rwblair · 2022-10-04T15:10:18Z

packages/openneuro-app/src/scripts/dataset/download/download-native.js

+        await body.pipeThrough(progress).pipeTo(writable)
+      } else {
+        apmTransaction.captureError(statusText)
+        return requestFailureToast()


Is the following accurate?: In earlier versions if a file failed the entire download would return then and there, where as this version should only result in the current directory download being abandoned and all other paths will still be downloaded. If so I could imagine a max failed download number to have downloadNative bail out on could be a thing.

That's right. You can cancel the entire download by closing the toast but otherwise I think we want to try to get as much as possible, so that if something is missed you can resume with a second attempt and only need a small subset to get a complete dataset.

That brings up a fix here though, I think this should show you if any files failed due to network or other interruptions at the end and suggest a retry. Otherwise it's easy to assume you have the whole dataset when one file is missing.

packages/openneuro-app/src/scripts/dataset/download/download-native.js

nellh · 2022-10-04T15:48:25Z

services/datalad/datalad_service/handlers/files.py

@@ -16,52 +15,42 @@ def __init__(self, store):
        self.store = store
        self.logger = logging.getLogger('datalad_service.' + __name__)

-    def on_get(self, req, resp, dataset, filename=None, snapshot='HEAD'):
+    def on_get(self, req, resp, dataset, filename, snapshot='HEAD'):


This was overloaded so that the files request with an argument would return a file but without the file argument it would return the listing. The change here is to make filename required and move all the listing functionality to the TreeResource class.

packages/openneuro-app/src/scripts/dataset/download/download-native.js

codecov · 2022-10-04T16:55:07Z

Codecov Report

Merging #2678 (ecf0804) into master (e449cfe) will decrease coverage by 0.10%.
The diff coverage is 59.45%.

@@            Coverage Diff             @@
##           master    #2678      +/-   ##
==========================================
- Coverage   39.66%   39.56%   -0.11%     
==========================================
  Files         561      562       +1     
  Lines       34273    34282       +9     
  Branches      910      903       -7     
==========================================
- Hits        13594    13563      -31     
- Misses      20554    20593      +39     
- Partials      125      126       +1

Impacted Files	Coverage Δ
packages/openneuro-app/src/scripts/apm.js	`66.66% <ø> (-7.25%)`	⬇️
...app/src/scripts/dataset/download/download-query.js	`15.38% <0.00%> (-1.29%)`	⬇️
...es/openneuro-app/src/scripts/types/dataset-file.ts	`0.00% <0.00%> (ø)`
packages/openneuro-cli/src/actions.js	`0.00% <0.00%> (ø)`
packages/openneuro-cli/src/datasets.js	`61.42% <0.00%> (-1.81%)`	⬇️
...ckages/openneuro-server/src/datalad/description.js	`79.47% <ø> (+0.52%)`	⬆️
packages/openneuro-server/src/datalad/draft.js	`50.00% <ø> (+3.06%)`	⬆️
...es/openneuro-server/src/graphql/resolvers/draft.js	`0.00% <0.00%> (ø)`
packages/openneuro-server/src/graphql/schema.js	`0.00% <0.00%> (ø)`
.../datalad/datalad_service/handlers/annex_objects.py	`53.33% <0.00%> (+7.50%)`	⬆️
... and 30 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Drop react-spring which was effectively commented out post-redesign.

rwblair · 2022-10-05T16:17:07Z

Does this call to get_repo_files need to be amended, wasn't sure if snapshot there was treeish:

openneuro/services/datalad/datalad_service/handlers/annex_objects.py

Line 21 in 6981624

files = get_repo_files(dataset_path, snapshot)

* output of above used by pygit2 repo.index.remove_all(...) which is a recursive delete.

nellh · 2022-10-05T16:19:10Z

Does this call to get_repo_files need to be amended, wasn't sure if snapshot there was treeish:

openneuro/services/datalad/datalad_service/handlers/annex_objects.py

Line 21 in 6981624

files = get_repo_files(dataset_path, snapshot)

The snapshot value here is the tag (or the commit hash, either way) and will return the commit tree.

…epth

…etry.

…call

nellh · 2022-10-05T18:38:11Z

Got it, fixed in b9e1243

effigies · 2022-10-05T20:18:27Z

services/datalad/datalad_service/common/annex.py

    if filename == '.gitattributes':
        return


This will now be covered by L53. Are we happy to also be skipping .gitmodules, .gitignore, etc?

This just affects if OpenNeuro displays these, so do we want to display .gitignore? We don't allow upload of it but it could exist via git push.

I think it's fine to exclude from display. We should probably reject git pushes with gitignores until we drop the working tree.

nellh added 24 commits September 29, 2022 16:30

refactor(server): Use a git tree object approach for file recursion

b4a3cd3

refactor(app): Adopt git tree loading for dataset page file tree

f4a3aa7

fix(app): Convert FileTree to TypeScript

0b4a26e

fix(worker): Revert test example for full vs basename paths

2928632

fix(app): Convert Files component to TypeScript

e2d4ed9

types: Add DatasetFile type for API file listings

8b9284f

fix(app): Fix nested directory names to show only filename

c3303a2

fix(app): TypeScript build fixes for file tree components

c8ffa20

fix(worker): Avoid looking up URLs for tree objects

10f96fb

fix(api): Simplify getFiles to only return file listings

1598c57

docs(api): Add documentation for how to retrieve file listings

16a94f0

fix(worker): Sort files in a BIDS aware order

326dae2

fix(app): Remove client side sort behavior and adjust tests for new t…

88c2f79

…ree APIs

test(worker): Add test for worker sorting behavior

67e10b3

fix(api): Use bids-validator size

4bef165

fix(api): Don't return a false size on snapshot creation

b79eacf

fix(cli): Reverse tag listing during downloads (newest first)

c0419a4

fix(cli): Use new download API

f3acc9e

Adds better progress for file downloads as well.

fix(server): Make size resolver more robust to missing data

605ae7c

fix(cli): Cleanup unused GraphQL query

2413bef

fix(app): Fix APM init issue that prevents local testing of downloads

65bb342

refactor(app): Use new file trees during browser download

a18a337

fix(app): Improve progress handling for native browser downloads

1bdfd80

nellh marked this pull request as ready for review October 3, 2022 22:43

nellh requested a review from rwblair October 3, 2022 22:43

nellh added 2 commits October 3, 2022 20:37

fix(api): Remove unused cache clear from draft files

93598ab

tests(worker): Set explicit 'directory': False values on non-annexed …

5f11372

…files

nellh force-pushed the datalad-service/tree-based-files branch from f8c1774 to 5f11372 Compare October 4, 2022 15:41

docs(api): Update examples for file listings

c51a3d0

rwblair reviewed Oct 4, 2022

View reviewed changes

nellh commented Oct 4, 2022

View reviewed changes

nellh added 2 commits October 4, 2022 09:59

fix(api): Return null size earlier for size resolvers

460dfc3

fix(app): Improve loading state for file tree directories

6981624

Drop react-spring which was effectively commented out post-redesign.

nellh mentioned this pull request Oct 5, 2022

Debug cumulative delay in advancedSearchDatasets #2673

Closed

nellh added 4 commits October 5, 2022 11:11

fix(app): Use annexed boolean for File component

4556161

tests(app): Test fileTreeLevels behavior to render files at correct d…

9550c69

…epth

fix(app): Show the filename for any failed files and suggest a user r…

90f1fa3

…etry.

fix(worker): Simplify removing annex objects to avoid get_repo_files …

b9e1243

…call

fix(app): Fix file.size type for File leaf component

ecf0804

effigies reviewed Oct 5, 2022

View reviewed changes

rwblair approved these changes Oct 5, 2022

View reviewed changes

nellh merged commit 52fd570 into master Oct 5, 2022

nellh deleted the datalad-service/tree-based-files branch October 5, 2022 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File cache tree support #2678

File cache tree support #2678

nellh commented Sep 29, 2022 •

edited

Loading

nellh commented Oct 3, 2022

rwblair Oct 4, 2022 •

edited

Loading

nellh Oct 4, 2022

nellh Oct 4, 2022

codecov bot commented Oct 4, 2022 •

edited

Loading

rwblair commented Oct 5, 2022 •

edited

Loading

nellh commented Oct 5, 2022

nellh commented Oct 5, 2022

effigies Oct 5, 2022

nellh Oct 5, 2022

effigies Oct 5, 2022

File cache tree support #2678

File cache tree support #2678

Conversation

nellh commented Sep 29, 2022 • edited Loading

nellh commented Oct 3, 2022

rwblair Oct 4, 2022 • edited Loading

Choose a reason for hiding this comment

nellh Oct 4, 2022

Choose a reason for hiding this comment

nellh Oct 4, 2022

Choose a reason for hiding this comment

codecov bot commented Oct 4, 2022 • edited Loading

Codecov Report

rwblair commented Oct 5, 2022 • edited Loading

nellh commented Oct 5, 2022

nellh commented Oct 5, 2022

effigies Oct 5, 2022

Choose a reason for hiding this comment

nellh Oct 5, 2022

Choose a reason for hiding this comment

effigies Oct 5, 2022

Choose a reason for hiding this comment

nellh commented Sep 29, 2022 •

edited

Loading

rwblair Oct 4, 2022 •

edited

Loading

codecov bot commented Oct 4, 2022 •

edited

Loading

rwblair commented Oct 5, 2022 •

edited

Loading