Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

Process all asset path components at once #5

Open
jwodder opened this issue Jan 8, 2024 · 3 comments
Open

Process all asset path components at once #5

jwodder opened this issue Jan 8, 2024 · 3 comments
Labels
performance More efficient use of time and space

Comments

@jwodder
Copy link
Member

jwodder commented Jan 8, 2024

Currently, when resolving a request path of the form /dandisets/123456/draft/foo/bar/baz, wsgidav traverses the virtual file tree one level at a time. In particular, this means that, after the /dandisets/123456/draft is resolved to the given Dandiset & version, wsgidav then requests the member named foo, the member of that named bar, and the member of that named baz, resulting in three requests to …/asset/paths/ with different path_prefixes, yet the same effect could be accomplished with just one request to …/asset/paths/ using the final path_prefix. Simplifying the path processing in this way can be done by rewriting DandiProvider.get_resource_inst().

Problem: Technically, each level of the asset path needs to be retrieved separately in order to check whether the path up to that point refers to a Zarr. However, we can get the above performance improvement while still handling all "well-behaved" Zarrs by just checking whether each path component ends in a Zarr extension (.zarr or .ngff) and, if it doesn't, assuming that the path up to that point does not refer to a Zarr. As this can technically result in inaccuracies for pathological Dandisets, the use of this strategy should be opted-into via a command-line option.

@jwodder jwodder added the performance More efficient use of time and space label Jan 8, 2024
@yarikoptic
Copy link
Member

Problem: Technically, each level of the asset path needs to be retrieved separately in order to check whether the path up to that point refers to a Zarr.

If such requests succeeds, wouldn't it mean that it is not within zarr?

then it is a matter of heuristic optimization which would jump between two possible access scenarios, and just chooses the order based on the case:

  • S1: full path_prefix
  • S2: path_prefix stopped at .zarr/ with the rest of the path -- within .zarr

if we pre-determine that there is .zarr/ within path_prefix -- try S2, and if that fails (there is no asset at that .zarr/) -- try S1; if three is no .zarr -- just go for S1 right away.

@jwodder
Copy link
Member Author

jwodder commented Jan 8, 2024

@yarikoptic

If such requests succeeds, wouldn't it mean that it is not within zarr?

If you mean "the request for the entire asset path as a path_prefix," yes. However, any heuristic for identifying Zarrs based on file extension cannot be 100% accurate, as it's possible for someone using the Archive API directly to create a Zarr without a .zarr/.ngff extension and to create a blob asset or asset folder with a .zarr/.ngff extension.

@yarikoptic
Copy link
Member

as it's possible for someone using the Archive API directly to create a Zarr without a .zarr/.ngff extension

I wouldn't worry about this one for now, although we could handle it indeed with incremental backing off path components until hitting zarr at that path prefix.

to create a blob asset or asset folder with a .zarr/.ngff extension.

that is why there was that "try S2, and if that fails (there is no asset at that .zarr/) -- try S1"

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
performance More efficient use of time and space
Projects
None yet
Development

No branches or pull requests

2 participants