Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: url and object splitting for local files #1007

Conversation

lobis
Copy link
Collaborator

@lobis lobis commented Oct 24, 2023

Fixes #1006.

Add additional tests to cover more cases.

@lobis lobis marked this pull request as ready for review October 24, 2023 20:56
@lobis lobis requested a review from jpivarski October 24, 2023 20:56
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this being so difficult, with so many corner cases. If I could do Uproot 3 over again (or whichever version introduced those !*$&@# colons), I wouldn't have added this feature. But once added, people use it a lot.

Another time I could have killed it was in the Uproot 3 → 4 transition, but users accustomed to the colon thought that Uproot 4 couldn't read the files because the colon wasn't implemented at first: #79.

Just make the one change described below with a test that is sensitive to it and then you can merge this PR.

path = path.strip()

# split url into parts
parsed_url = urlparse(path)
if "://" in path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be tighter (URI scheme syntax):

>>> fsm = re.compile("^[a-zA-Z][a-zA-Z0-9\+\.\-]*://")

>>> fsm.match("http://stuff")
<re.Match object; span=(0, 7), match='http://'>

>>> fsm.match("/tmp/filename.root://stuff")
None

>>> fsm.match("/tmp/filename.root:/subdirectory/stuff")
None

Maybe instead of matching schemes as ^[a-zA-Z][a-zA-Z0-9\+\.\-]*, we should have a short list of supported schemes? Is there a way of asking fsspec what it supports?

Because we can't do anything about this:

>>> fsm.match("filename.root://stuff")
<re.Match object; span=(0, 16), match='filename.root://'>

unless we say that a scheme can be root but not filename.root.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, there's no good reason for users to put superfluous slashes in the object path. //one/two is exactly the same as one/two except that it's deliberately asking for trouble. Anyway, if someone runs into it, we can tell them to just remove the unnecessary // at the beginning of their object path.

But at least do

Suggested change
if "://" in path:
if _uri_scheme.match(path):

with

_uri_scheme = re.compile("^[a-zA-Z][a-zA-Z0-9\+\.\-]*://")

just above this function.

It's legal to have more colons and slashes inside the object path, and requiring the scheme to satisfy the URI specification would cut down on false positives.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be more robust. I will add the uri regex and merge this as is to fix the bug but I will also make another PR to improve this.

I like the idea of using fsspec to get the list of available protocols to match (it is possible to retrieve this). I think the logic should be as follows:

  • Try to match agains the available protocols.
  • If no match, try to find :// inside the string. If found throw an exception saying it's not a valid procotol and point the user towards the available protocols (print the list). This would throw on cases such as file.root://object treating file.root as an invalid schema.
  • if no schema found in the string, assume its a local file and prepend the file:// schema.
  • process the fully qualified uri.

There is a subtle thing to keep in mind. I believe uproot currently supports uppercase (or any case) schemas, such as "hTtP" because the comparison is caseless. I believe fsspec is case sensitive to the schema (and I have not seen any schema with uppercase). fsspec.filesystem("http") is valid while fsspec.filesystem("httP") is not. This would technically mean breaking changes to the API but I think it makes sense.

@lobis lobis merged commit 3c429c6 into main Oct 24, 2023
@lobis lobis deleted the 1006-file-object-separator-colon-parsing-pr-976-must-be-fixed-before-the-next-release branch October 24, 2023 23:13
lobis added a commit that referenced this pull request Oct 25, 2023
* origin/main:
  fix: url and object splitting for local files (#1007)
lobis added a commit that referenced this pull request Oct 25, 2023
* origin/main:
  fix: url and object splitting for local files (#1007)
@lobis lobis mentioned this pull request Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File-object separator colon parsing PR #976 must be fixed before the next release
2 participants