Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should UPath.__new__ return pathlib.Path instances for local paths #90

Closed
ap-- opened this issue Mar 29, 2023 · 16 comments · Fixed by #125
Closed

Should UPath.__new__ return pathlib.Path instances for local paths #90

ap-- opened this issue Mar 29, 2023 · 16 comments · Fixed by #125

Comments

@ap--
Copy link
Collaborator

ap-- commented Mar 29, 2023

Summarizing from #84

When creating a UPath instance universal_pathlib returns a pathlib.Path instance for local filesystems, which can cause problems with type checkers:

from upath import UPath     # UPath is a subclass of pathlib.Path

pth = UPath("/local/path")  # returns i.e. a pathlib.PosixPath instance on non-windows
reveal_type(pth)            # Revealed type is "upath.core.UPath"

Possible solutions:

(1) user explicitly type annotates

No changes. We could just document it very explicitly.

pth: pathlib.Path = UPath(...)

(2) always return a UPath instance

I think the reason for returning a pathlib.Path instance for local filesystems is to guarantee that there are no changes in behavior when using UPath as a replacement.

(2A) create an intermediate subclass

We could still guarantee pathlib behavior by creating an intermediate UPath class

class UPath(pathlib.Path):
    def __new__(cls, *args, **kwargs):
        if ...:                         # is local path
            return cls(*args, **kwargs)
        else:                           # promote to a fsspec backed class
            return FSSpecUPath(*args, **kwargs)
    
    # > we could define additional properties here too `.fs`, etc...

class FSSpecUPath(UPath):
    ...  # provides all the methods current UPath provides

(2B) we always return the fsspec LocalFileSystem backed UPath instance for local paths

This would be a simple change, but I think we should then port the CPython pathlib test suite to universal_pathlib, to guarantee as good as we can that the behavior is identical. I'm worried that symlink support, the windows UNC path stuff, and other weird edge-cases will probably cause quite a few issues.

(3) apply some magic

We could provide UPath as an alias for the UPath class but cast it to type[pathlib.Path] to trick the static typecheckers. And we could fallback to some metaclass magic to make runtime typechecking work, but this would basically (partially?) revert the cleanup done here: #56

@ap-- ap-- changed the title Should UPath.__new__ return pathlib.Path instances for local paths Should UPath.__new__ return pathlib.Path instances for local paths Mar 29, 2023
@ap-- ap-- mentioned this issue Apr 25, 2023
@brl0
Copy link
Contributor

brl0 commented Apr 28, 2023

I thought, perhaps mistakenly, that this previous behavior was a good thing:

import pathlib
from upath import UPath

path = UPath(".")
assert isinstance(path, pathlib.Path)
assert isinstance(path, UPath)

The idea, or at least so I thought, was to enable tools that support pathlib objects to easily support upath objects with little or no code changes.

If needed to check for identity of the class, something like this might work:

type(path) in (pathlib.WindowsPath, pathlib.PosixPath)

I'm sure there were good reasons why this behavior was removed, but I'm not sure why, and would greatly appreciate any clarification for my own understanding.

I was never sure that my previous metaclass approach was the right way to do it because the instance check stuff felt hacky, but it did seem to achieve the behavior that I thought I wanted. For reference, here's a simplified version of what that looked like:

class UPathMeta(ABCMeta):
    def __instancecheck__(cls, instance):
        return isinstance(instance, pathlib.Path)

    def __subclasscheck__(cls, subclass):
        return issubclass(subclass, pathlib.Path)

class UPath(pathlib.Path, metaclass=UPathMeta):
    ...

On a related note, I also think it would be cleaner to put the class resolution logic into get_upath_class, it seems fragmented to have some of that logic there and some in __new__.

I'm really just curious and interested in the pros and cons of various approaches, so I'd love any feedback.

@thomasw21
Copy link

Hi! Thanks for the repository, we were trying to find a way to work with multiple schemes and use the same API as pathlib, and this seems to be perfect for our usecase. We're trying to use UPath as well, and are encountering the same issue:

>>> isinstance(UPath("."), UPath)
False

This is causing some issues as we use UPath as both the constructor and the type validator. Given the previous example, I would always expect it to be True for any python class ...

@danielgafni
Copy link

Getting back to this... Would it be possible to get the old behavior back?

@normanrz
Copy link
Collaborator

normanrz commented Jun 7, 2023

UPath subclasses Path which allows for this behavior:

assert isinstance(UPath("s3://..."), Path)
assert isinstance(UPath("."), Path) and not isinstance(UPath("."), UPath)

So, UPath can be used everywhere Path is accepted which allows for an easy upgrade path in an existing code base. There is still a way to distinguish between UPath and Path.

I don't think it is great to change the behavior through metaclasses, because that messes with static analysis tooling such as type checkers.

Perhaps the type checking issues would be resolved if the signature of https://github.com/fsspec/universal_pathlib/blob/main/upath/core.py#L152 would be changed to return Path instead of PT?

@ap--
Copy link
Collaborator Author

ap-- commented Jun 7, 2023

I don't think it is great to change the behavior through metaclasses, because that messes with static analysis tooling such as type checkers.

I agree. This implicitly rules out restoring the old behavior.

@thomasw21 @danielgafni and @brl0, have I understood correctly, that the issues you're facing all boil down to:

>>> isinstance(UPath("."), UPath)  # for local paths
False

If yes, can any of you construct a case in which an intermediate UPath class would not work for your use case? (scenario 2A)

Basically the class hierarchy would then look like this:

pathlib.Path
|
+- upath.UPath  # adds properties like `.fs`, etc...
                # and would be used for pathlib style local paths
   |
   +- FSSpecPath  # default class for fsspec filesystems without special implementation in universal_pathlib
      |
      +- HTMLPath
      +- HDF5Path
      +- ... etc           

I will work on a PR this Sunday, so we can test if it would work for everyone.

Cheers,
Andreas

@thomasw21
Copy link

I was able to fix my issue by decorrelating constructors and typing. I would say as a general case that for a given class C type(C(*args, **kwargs)) != C is a weird pattern to me.

@ap--
Copy link
Collaborator Author

ap-- commented Jun 7, 2023

I would say as a general case that for a given class C type(C(*args, **kwargs)) != C is a weird pattern to me.

I partially agree: it's confusing the way it's implemented right now, since UPath(*args, **kwargs) can currently return an instance which is not a subclass of UPath.

Though the promotion to specialized subclasses is a common pattern, also used in stdlib pathlib: https://github.com/python/cpython/blob/7279fb64089abc73c03247fb8191082ee42a9671/Lib/pathlib.py#L1154-L1157

I was able to fix my issue by decorrelating constructors and typing.

Happy to hear that you found a workaround. It'll be interesting to check if it's not needed anymore after this issue is addressed.

@thomasw21
Copy link

thomasw21 commented Jun 7, 2023

Ah it's not needed anymore (I guess it'd be a good to have, but not blocking), we used dacite to parse yaml file into a bunch of data classes: https://github.com/konradhalas/dacite

@dataclass
class MyRemoteConfig:
    path: UPath

The original issue is that dacite uses UPath as both the constructor UPath(path_str) and as a type validator. So all we had to do is:

@dataclass
class MyRemoteConfig:
    path: Path

(we still use UPath as a constructor so that we can actually pass string with different schemes)

@danielgafni
Copy link

I have similar issues with validating values in Dagster. Which are supposed to be UPath, but are actually Path for local paths.

@brl0
Copy link
Contributor

brl0 commented Jun 7, 2023

I would say as a general case that for a given class C type(C(*args, **kwargs)) != C is a weird pattern to me

the promotion to specialized subclasses is a common pattern, also used in stdlib pathlib

The usage in the standard library allows for isinstance(Path(*args, **kwargs), Path), even though that will always return a WindowsPath or PosixPath.

My current thinking is that this behavior is a bad thing:

assert isinstance(UPath("."), Path) and not isinstance(UPath("."), UPath)

My current preference would be to revert to the old behavior, but that might be because I'm not clear on how that is breaking static type checkers. But since the consensus seems to be to avoid that approach, I would probably lean toward option 2B, while acknowledging the concern that it might run into some bugs in fsspec local file handling, but it is the backbone of this project, and those issues should be fixable.

I'd be curious what @barneygale would suggest as the most appropriate approach to inheriting from and being compatible with pathlib.

@barneygale
Copy link

I'm not very familiar with this project so forgive me if I get something wrong!

For Python 3.13 (>12 months away) I'm hoping to add a pathlib.AbstractPath class. Users could then do:

pth: pathlib.AbstractPath = UPath(...)

UPath.__new__() would be free to return a pathlib.Path instance, which will be a subclass of AbstractPath.

However, Path is not a subclass of UPath, so this still feels weird to me. So I think I learn towards your option 2B.

the promotion to specialized subclasses is a common pattern, also used in stdlib pathlib

FWIW, I never liked this. If I were to design pathlib from scratch, I would remove the *PosixPath and *WindowsPath classes, and instead add a flavour argument to PurePath. This would remove the need for __new__() methods and guarantee that users get back the type they asked for.

Taking that approach in universal_pathlib would mean adding dedicated classes for HTTPPath, S3Path, etc, and probably adding a module-level function to parse a URI and return an appropriate path instance:

s3_pth: universal_pathlib.S3Path = universal_pathlib.S3Path('foo/bar', bucket='blah')
any_pth: pathlib.AbstractPath = universal_pathlib.parse_uri('...')

... but obviously that's quite different! So take it or leave it :)

@barneygale
Copy link

Oh sorry, I see that there are already dedicated HTTPPath etc classes!

@ap--
Copy link
Collaborator Author

ap-- commented Jun 7, 2023

Thanks for the feedback!

FWIW, I never liked this. If I were to design pathlib from scratch, I would remove the *PosixPath and *WindowsPath classes, and instead add a flavour argument to PurePath. This would remove the need for __new__() methods and guarantee that users get back the type they asked for.

Is pathlib going to move in that direction further down? Looking at the current 3.12 implementation, it seems to be reasonable if we'd implement a filesystem_spec-url equivalent of posixpath/ntpath to be used as the flavour. In UPath we'd then select the filesystem flavour based on protocol.
Some of this functionality is already there.

Taking that approach in universal_pathlib would mean adding dedicated classes for HTTPPath, S3Path, etc, and probably adding a module-level function to parse a URI and return an appropriate path instance:

we kind of do this via the different implementations for specific filesystems.
UPath.__new__ then returns the specialized subclass (ie. the HttpPath, or an S3Path equivalent etc...) instead of a module-level function. But I see the appeal of the module level uri parse function.


Regarding `pathlib.AbstractPath`: I would be happy to contribute to this. `universal_pathlib` could be an early adopter to check if the implementation covers all use cases.

@brl0
Copy link
Contributor

brl0 commented Jun 7, 2023

This might be a terrible idea, but another option might be to backport the latest pathlib in place of option 2A.

@ap--
Copy link
Collaborator Author

ap-- commented Jun 7, 2023

But since the consensus seems to be to avoid that approach, I would probably lean toward option 2B, while acknowledging the concern that it might run into some bugs in fsspec local file handling, but it is the backbone of this project, and those issues should be fixable.

Fully agree. It's just quite a bit of work to port the important parts of the test suite. But definitely doable.

This might be a terrible idea, but another option might be to backport the latest pathlib in place of option 2A.

We're basically doing this for python 3.7, 3.8 and 3.9 already 😄
Ultimately we have to do this, to avoid having to have different implementations on our side. We didn't do it for 3.11 because we were waiting for the new implementation in 3.12.

@barneygale
Copy link

barneygale commented Jun 7, 2023

Is pathlib going to move in that direction further down?

Unlikely, but not impossible.

When users subclass AbstractPath, they should be able to choose between POSIX, Windows, or current-OS semantics. This can be done by setting _flavour to posixpath, ntpath, or os.path respectively. But the underscore at the beginning of _flavour will need to go, so PurePath.flavour will become a public attribute.

Users may then point out that there are two ways to check if a path uses Windows semantics (for example):

  1. isinstance(pth, PureWindowsPath)
  2. pth.flavour is ntpath

The first of these will not work for subclasses of AbstractPath.

If this situation proves very confusing, we might add a flavour argument to PurePath, turn the *PosixPath and *WindowsPath classes into pure protocols that check flavour and return PurePath when instantiated, and deprecate them.

But it's hard to project that far into the future, and the change would need to be extremely well motivated.

Ticket: python/cpython#100502

Regarding pathlib.AbstractPath: I would be happy to contribute to this. universal_pathlib could be an early adopter to check if the implementation covers all use cases.

Fantastic, thank you. My plan at the moment is to add a tarfile.TarPath class, using a new pathlib._AbstractPath class under the hood. I'll ping you as soon as I have something you can play with!

Ticket: python/cpython#89812

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants