-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File cache tree support #2678
File cache tree support #2678
Conversation
This breaking API change for the worker moves to per git tree file requests instead of prefix slicing the entire file tree. This allows for more efficient top level requests and recursion by fetching smaller slices of the dataset as needed.
Adds better progress for file downloads as well.
One limitation here is this is returning the size reported by the validator again. I think we should address this on the validator side, it doesn't make sense for OpenNeuro to crawl the whole dataset just to report an accurate size when the validator has to do that anyways. I did try using git-annex info to get an estimate but this is still too slow for our largest datasets given storage performance available. |
f8c1774
to
5f11372
Compare
await body.pipeThrough(progress).pipeTo(writable) | ||
} else { | ||
apmTransaction.captureError(statusText) | ||
return requestFailureToast() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the following accurate?: In earlier versions if a file failed the entire download would return then and there, where as this version should only result in the current directory download being abandoned and all other paths will still be downloaded. If so I could imagine a max failed download number to have downloadNative
bail out on could be a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. You can cancel the entire download by closing the toast but otherwise I think we want to try to get as much as possible, so that if something is missed you can resume with a second attempt and only need a small subset to get a complete dataset.
That brings up a fix here though, I think this should show you if any files failed due to network or other interruptions at the end and suggest a retry. Otherwise it's easy to assume you have the whole dataset when one file is missing.
@@ -16,52 +15,42 @@ def __init__(self, store): | |||
self.store = store | |||
self.logger = logging.getLogger('datalad_service.' + __name__) | |||
|
|||
def on_get(self, req, resp, dataset, filename=None, snapshot='HEAD'): | |||
def on_get(self, req, resp, dataset, filename, snapshot='HEAD'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was overloaded so that the files request with an argument would return a file but without the file argument it would return the listing. The change here is to make filename required and move all the listing functionality to the TreeResource class.
Codecov Report
@@ Coverage Diff @@
## master #2678 +/- ##
==========================================
- Coverage 39.66% 39.56% -0.11%
==========================================
Files 561 562 +1
Lines 34273 34282 +9
Branches 910 903 -7
==========================================
- Hits 13594 13563 -31
- Misses 20554 20593 +39
- Partials 125 126 +1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Drop react-spring which was effectively commented out post-redesign.
Does this call to get_repo_files need to be amended, wasn't sure if
* output of above used by pygit2 repo.index.remove_all(...) which is a recursive delete. |
The snapshot value here is the tag (or the commit hash, either way) and will return the commit tree. |
Got it, fixed in b9e1243 |
if filename == '.gitattributes': | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will now be covered by L53. Are we happy to also be skipping .gitmodules
, .gitignore
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just affects if OpenNeuro displays these, so do we want to display .gitignore? We don't allow upload of it but it could exist via git push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to exclude from display. We should probably reject git pushes with gitignores until we drop the working tree.
This replaces the commit files cache that many operations depend on with a per git tree object cache that can be loaded in incrementally.