Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heads up: pathlib.VirtualPath may be coming soon #140

Open
barneygale opened this issue Jul 6, 2023 · 11 comments
Open

Heads up: pathlib.VirtualPath may be coming soon #140

barneygale opened this issue Jul 6, 2023 · 11 comments

Comments

@barneygale
Copy link

I've logged a CPython PR that adds a private pathlib._VirtualPath class:

python/cpython#106337

If/when that lands, it's not much more work to drop the underscore and introduce pathlib.VirtualPath to the world. This would be Python 3.13 at the earliest.

Would it be suitable for use in s3path? It would be great to hear your feedback!

@liormizr
Copy link
Owner

Thanks @barneygale for letting us know
As I looking at your PR with the new _PathBase implementation it doesn't really solve the issues that we have with pathlib extensions & integrations.

Instead of basing our Accessor layer in Accessor class (the old design that was removed from the pathlib few versions ago) basing S3Path from Path or from _PathBase it's not really matter from our point of view.
When it will be in the python version we will do the change.

The issue that we do have is with how other packages react to path objects.

for example Pandas
In Pandas, this is the api for read_csv (reading a csv file).
if you are working with strings:

# for local path
pandas.read_csv('/path/to/table.csv', ...)
# or if you want to work with uri representation
pandas.read_csv('file://localhost/path/to/table.csv', ...)

And when you want S3 you can do like this:

# for S3 uri representation
pandas.read_csv('s3://path/to/table.csv', ...)

Now when you are working with Pandas and Path objects.
Pandas are using the __fspath__ method to understand what to do.
This brings two main issues.

  1. fspath from the spec doesn't return uri, it returns the file system path representation of the object
    So:
from pathlib import Path
from s3path import S3Path

local_path = Path('/path/to/table.csv')
pandas.read_csv(local_path, ...) # it's the same like passing '/path/to/table.csv'

# but in S3Path we braike
s3_path = S3Path('/path/to/table.csv')
pandas.read_csv(s3_path, ...) # it's the same like passing '/path/to/table.csv' instead of s3://path/to/table.csv
  1. the second issue is that there is multiple ways to define uri
    When we are asking from Path the uri it's different between what we are giving and when Panda expect from us
from pathlib import Path

local_path = Path('/path/to/table.csv')

pandas.read_csv(local_path.as_uri(), ...) # This will failed. file:///path/to/table.csv is not what Pandas expect
# not the same like:
pandas.read_csv('file://localhost/path/to/table.csv', ...) # This will failed. file:///path/to/table.csv is not what Pandas expect

My point is not that Pandas have issues, Pandas was just one example.

I think that we still have to think about the general purpose and design of methods like __str__, __fspath__, __repr__.
Thinking about moving us for more uri's representation and how to give packages the tools / guidns to relay on them.

make sense?
What do you think about that?

@barneygale
Copy link
Author

barneygale commented Jul 16, 2023

Thanks for the feedback, that makes perfect sense.

My hope is that APIs like pandas.read_csv() will start accepting PathBase objects, roughly like this:

def read_csv(path: pathlib.PathBase | os.PathLike):
    if not isinstance(path, pathlib.PathBase):
        path = pathlib.Path(path)
    with path.open() as f:
        ...

It will take a while for third-party packages to begin accepting PathBase, similar to how it took a while for them to begin accepting os.PathLike.

Does that make sense?

On __fspath__(): the implementation of PathBase does not define __fspath__(), because it would result in nonsense like open(ZipPath('README')) opening a file called 'README' in the current directory -- very surprising!

On URIs: the implementation of PathBase.as_uri() raises UnsupportedOperation, because the path type might not use file: URIs. You're right that file: URIs are implemented differently across different packages. I'm not sure if leaning more heavily into URIs is the right way to go - it would involve adding, I think, a new __uri__() magic method, and probably a registry of scheme handlers.

@merwok

This comment was marked as resolved.

@liormizr
Copy link
Owner

Make perfect sense @barneygale
Thanks for your answer

Regarding _PathBase we will do the move when it will be available
Will you keep the privet name "_PathBase" or will it be changed to a public class?

@barneygale
Copy link
Author

with open(path) as f:

Oh, I wasn’t expecting that for Path classes, only methods! How can the builtin open work with a TarPath for example?

Shit sorry, horrible typo on my part. Will correct.

@barneygale
Copy link
Author

Regarding _PathBase we will do the move when it will be available Will you keep the privet name "_PathBase" or will it be changed to a public class?

It'll become a public class, hopefully in time for Python 3.13 (next year)

@barneygale
Copy link
Author

barneygale commented Dec 27, 2023

Hello - I've published CPython's private PathBase class in a PyPI package: https://pypi.org/project/pathlib-abc/0.1.0/. No docs yet but I should have them ready soon.

If the PyPI package succeeds and matures, I'm hopeful we can make PathBase public in Python itself. Hope that's helpful. Grateful for any feedback you might have!

@maresb
Copy link
Contributor

maresb commented Oct 22, 2024

@barneygale, I was just reading the Python 3.13 release notes. Congrats on the inclusion of your pathlib.Path.from_uri()!!!

@liormizr
Copy link
Owner

@barneygale Why from_uri was added to the Path class and not to the PurePath class?

@barneygale
Copy link
Author

barneygale commented Oct 26, 2024

Both as_uri() and from_uri() are strictly impure because they use os.fsencode() and os.fsdecode(), which can vary from system to system. I also adjusted the documentation for as_uri() so that it's documented as part of Path rather than PurePath, with a note:

For historical reasons, this method is also available from PurePath objects. However, its use of os.fsencode() makes it strictly impure.

@liormizr
Copy link
Owner

Both as_uri() and from_uri() are strictly impure because they use os.fsencode() and os.fsdecode(), which can vary from system to system. I also adjusted the documentation for as_uri() so that it's documented as part of Path rather than PurePath, with a note:

For historical reasons, this method is also available from PurePath objects. However, its use of os.fsencode() makes it strictly impure.

In my logic all IO oriented factuality need to be in the Path level
And all string/path structure parsing need to be in the PurePath level

How ever I full understand the logic you specify :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants