-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: url and object splitting for local files #1007
fix: url and object splitting for local files #1007
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about this being so difficult, with so many corner cases. If I could do Uproot 3 over again (or whichever version introduced those !*$&@# colons), I wouldn't have added this feature. But once added, people use it a lot.
Another time I could have killed it was in the Uproot 3 → 4 transition, but users accustomed to the colon thought that Uproot 4 couldn't read the files because the colon wasn't implemented at first: #79.
Just make the one change described below with a test that is sensitive to it and then you can merge this PR.
src/uproot/_util.py
Outdated
path = path.strip() | ||
|
||
# split url into parts | ||
parsed_url = urlparse(path) | ||
if "://" in path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be tighter (URI scheme syntax):
>>> fsm = re.compile("^[a-zA-Z][a-zA-Z0-9\+\.\-]*://")
>>> fsm.match("http://stuff")
<re.Match object; span=(0, 7), match='http://'>
>>> fsm.match("/tmp/filename.root://stuff")
None
>>> fsm.match("/tmp/filename.root:/subdirectory/stuff")
None
Maybe instead of matching schemes as ^[a-zA-Z][a-zA-Z0-9\+\.\-]*
, we should have a short list of supported schemes? Is there a way of asking fsspec what it supports?
Because we can't do anything about this:
>>> fsm.match("filename.root://stuff")
<re.Match object; span=(0, 16), match='filename.root://'>
unless we say that a scheme can be root
but not filename.root
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, there's no good reason for users to put superfluous slashes in the object path. //one/two
is exactly the same as one/two
except that it's deliberately asking for trouble. Anyway, if someone runs into it, we can tell them to just remove the unnecessary //
at the beginning of their object path.
But at least do
if "://" in path: | |
if _uri_scheme.match(path): |
with
_uri_scheme = re.compile("^[a-zA-Z][a-zA-Z0-9\+\.\-]*://")
just above this function.
It's legal to have more colons and slashes inside the object path, and requiring the scheme to satisfy the URI specification would cut down on false positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should be more robust. I will add the uri regex and merge this as is to fix the bug but I will also make another PR to improve this.
I like the idea of using fsspec
to get the list of available protocols to match (it is possible to retrieve this). I think the logic should be as follows:
- Try to match agains the available protocols.
- If no match, try to find
://
inside the string. If found throw an exception saying it's not a valid procotol and point the user towards the available protocols (print the list). This would throw on cases such asfile.root://object
treatingfile.root
as an invalid schema. - if no schema found in the string, assume its a local file and prepend the
file://
schema. - process the fully qualified uri.
There is a subtle thing to keep in mind. I believe uproot currently supports uppercase (or any case) schemas, such as "hTtP" because the comparison is caseless. I believe fsspec
is case sensitive to the schema (and I have not seen any schema with uppercase). fsspec.filesystem("http")
is valid while fsspec.filesystem("httP")
is not. This would technically mean breaking changes to the API but I think it makes sense.
* origin/main: fix: url and object splitting for local files (#1007)
* origin/main: fix: url and object splitting for local files (#1007)
Fixes #1006.
Add additional tests to cover more cases.