Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tarfile.TarPath #89812

Open
FFY00 opened this issue Oct 28, 2021 · 12 comments
Open

Add tarfile.TarPath #89812

FFY00 opened this issue Oct 28, 2021 · 12 comments
Labels
3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@FFY00
Copy link
Member

FFY00 commented Oct 28, 2021

BPO 45649
Nosy @jaraco, @ethanfurman, @FFY00, @barneygale

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-10-28.15:59:21.866>
labels = ['type-feature', 'library', '3.11']
title = 'Add tarinfo.Path'
updated_at = <Date 2022-01-02.00:01:26.196>
user = 'https://github.com/FFY00'

bugs.python.org fields:

activity = <Date 2022-01-02.00:01:26.196>
actor = 'barneygale'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2021-10-28.15:59:21.866>
creator = 'FFY00'
dependencies = []
files = []
hgrepos = []
issue_num = 45649
keywords = []
message_count = 6.0
messages = ['405194', '405206', '405210', '409479', '409481', '409482']
nosy_count = 4.0
nosy_names = ['jaraco', 'ethan.furman', 'FFY00', 'barneygale']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue45649'
versions = ['Python 3.11']

Linked PRs

@FFY00
Copy link
Member Author

FFY00 commented Oct 28, 2021

It would be helpful to have a pathlib-compatible object in tarfile, similarly to zipfile.Path.

@FFY00 FFY00 added 3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Oct 28, 2021
@jaraco
Copy link
Member

jaraco commented Oct 28, 2021

I vaguely recall exploring this concept and finding that tarfiles don’t supply the requisite interface because they’re not random access. I’m only 10% confident in that recollection, so worth exploring.

@FFY00
Copy link
Member Author

FFY00 commented Oct 28, 2021

That is good to know. This isn't very high on my priority list, but I will try to explore when I have some time.

@barneygale
Copy link
Mannequin

barneygale mannequin commented Jan 1, 2022

It's possible to do, but will be a little slow due to the nature of tar files. They're a big linked list of files, so you need to do a bunch of reads/seeks from the start to the end to enumerate all files.

I'd ask that we try to get bpo-24132 solved first. That would let us write:

    # tarfile.py
    class Path(pathlib.AbstractPath):
        def iterdir(self):
            ...
        def stat(self):
            ...

We'd fill in a smallish number of abstract methods to get a full Path-compatible class with read_text(), is_symlink() etc methods.

@jaraco
Copy link
Member

jaraco commented Jan 1, 2022

I'd recommend not to block on bpo-24132. It's not obvious to me that subclassing would be valuable. It depends on how it's implemented, but in my experience, zipfile.Path doesn't and cannot implement the full interface of pathlib.Path. Instead zipfile.Path attempts to implement a protocol. At the time, the protocol was undefined, but now there exists importlib.resources.abc.Traversable (https://docs.python.org/3/library/importlib.html#importlib.abc.Traversable), the interface needed by importlib.resources. I'd honestly just create a standalone class, see if it can implement Traversable, and only then consider if it should implement a more complicated interface (such as something with symlink support or perhaps even later subclassing from pathlib.Path).

@barneygale
Copy link
Mannequin

barneygale mannequin commented Jan 2, 2022

If you're only aiming for Traversable compatibility, sure.

The original bug description asks for something that's pathlib-compatible and similar to zipfile.Path, which goes beyond the Traversable interface in attempting to emulate pathlib.Path.

The pathlib.Path interface is a good one - I see no reason it can't apply to zip and tar archives in full. Methods of Path objects already raise NotImplementedError if operations aren't supported (e.g. creating symlinks)

Some prototyping from a couple years back, including a tar path implementation: https://github.com/barneygale/pathlab/tree/master/pathlab

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@encukou
Copy link
Member

encukou commented May 3, 2023

Anyone implementing this should be aware that tar members don't have unique names.
A path should generally refer to the last member of the given name (which would overwrite previous ones when extracting -- unless the previous members do shenanigans with symlinks/directories, but API for accessing individual members isn't expected to take that into account).
TarFile API like getmember will take care of most of this, but not all: for example, a “listdir” operation should probably de-duplicate its result.

@barneygale barneygale changed the title Add tarinfo.Path Add tarfile.TarPath May 5, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 6, 2023
Re-arrange `pathlib.Path` methods in source code. No other changes.

The methods are arranged as follows:

1. `stat()` and dependants (`exists()`, `is_dir()`, etc)
2. `open()` and dependants (`read_text()`, `write_bytes()`, etc)
3. `iterdir()` and dependants (`glob()`, `walk()`, etc)
4. All other `Path` methods

This patch prepares the ground for a new `_AbstractPath` class, which will
support the methods in groups 1, 2 and 3 above. By churning the methods
here, subsequent patches will be easier to review and less likely to break
things.
barneygale added a commit that referenced this issue May 7, 2023
Re-arrange `pathlib.Path` methods in source code. No other changes.

The methods are arranged as follows:

1. `stat()` and dependants (`exists()`, `is_dir()`, etc)
2. `open()` and dependants (`read_text()`, `write_bytes()`, etc)
3. `iterdir()` and dependants (`glob()`, `walk()`, etc)
4. All other `Path` methods

This patch prepares the ground for a new `_AbstractPath` class, which will
support the methods in groups 1, 2 and 3 above. By churning the methods
here, subsequent patches will be easier to review and less likely to break
things.
jbower-fb pushed a commit to jbower-fb/cpython-jbowerfb that referenced this issue May 8, 2023
Re-arrange `pathlib.Path` methods in source code. No other changes.

The methods are arranged as follows:

1. `stat()` and dependants (`exists()`, `is_dir()`, etc)
2. `open()` and dependants (`read_text()`, `write_bytes()`, etc)
3. `iterdir()` and dependants (`glob()`, `walk()`, etc)
4. All other `Path` methods

This patch prepares the ground for a new `_AbstractPath` class, which will
support the methods in groups 1, 2 and 3 above. By churning the methods
here, subsequent patches will be easier to review and less likely to break
things.
carljm added a commit to carljm/cpython that referenced this issue May 9, 2023
* main: (47 commits)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  pythonGH-104145: Use fully-qualified cross reference types for the bisect module (python#104172)
  pythongh-103193: Improve `getattr_static` test coverage (python#104286)
  Trim trailing whitespace and test on CI (python#104275)
  pythongh-102500: Remove mention of bytes shorthand (python#104281)
  pythongh-97696: Improve and fix documentation for asyncio eager tasks (python#104256)
  pythongh-99108: Replace SHA3 implementation HACL* version (python#103597)
  pythongh-104273: Remove redundant len() calls in argparse function (python#104274)
  pythongh-64660: Don't hardcode Argument Clinic return converter result variable name (python#104200)
  pythongh-104265 Disallow instantiation of `_csv.Reader` and `_csv.Writer` (python#104266)
  pythonGH-102613: Improve performance of `pathlib.Path.rglob()` (pythonGH-104244)
  pythongh-103650: Fix perf maps address format (python#103651)
  pythonGH-89812: Churn `pathlib.Path` methods (pythonGH-104243)
  ...
barneygale added a commit to barneygale/cpython that referenced this issue May 23, 2023
This internal class excludes the `__fspath__()`, `__bytes__()` and
`as_uri()` methods, which must not be inherited by a future
`tarfile.TarPath` class.
barneygale added a commit to barneygale/cpython that referenced this issue May 23, 2023
This internal class excludes the `__fspath__()`, `__bytes__()` and
`as_uri()` methods, which must not be inherited by a future
`tarfile.TarPath` class.
barneygale added a commit to barneygale/cpython that referenced this issue May 25, 2023
barneygale added a commit to barneygale/cpython that referenced this issue May 29, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Jun 24, 2023
Slightly expand the test coverage of `is_junction()`. The existing test
only checks that `os.path.isjunction()` is called under-the-hood.
barneygale added a commit to barneygale/cpython that referenced this issue Jun 24, 2023
- Split out dedicated test for unbuffered `open()`
- Split out dedicated test for `is_mount()` at the filesystem root
- Avoid `os.stat()` when checking that empty paths point to '.'
- Remove unnecessary `rmtree()` call
barneygale added a commit to barneygale/cpython that referenced this issue Jun 24, 2023
Make assertions about the `st_mode`, `st_ino` and `st_dev` attributes of
the stat results from two files and a directory, rather than checking if
`chmod()` affects `st_mode` (which is already tested elsewhere).
barneygale added a commit to barneygale/cpython that referenced this issue Jun 26, 2023
barneygale added a commit that referenced this issue Jun 30, 2023
Remove `PathTest.dirlink()` function. Symlinks in `PathTest.setUp()` are
created using `os.symlink()` directly; symlinks in test functions use
`Path.symlink_to()` in order to make the tests applicable to a
(near-)future `AbstractPath` class.
barneygale added a commit that referenced this issue Jul 1, 2023
)

Adjust the pathlib tests to add a new `PathTest.can_symlink` class
attribute, which allows us to enable or disable symlink support in tests.
A (near-)future commit will add an `AbstractPath` class; its tests will
hard-code the value to `True` or `False` depending on a stub subclass's
capabilities.
barneygale added a commit that referenced this issue Jul 1, 2023
…6062)

Slightly expand the test coverage of `is_junction()`. The existing test
only checks that `os.path.isjunction()` is called under-the-hood.
barneygale added a commit to barneygale/cpython that referenced this issue Jul 1, 2023
barneygale added a commit that referenced this issue Jul 1, 2023
- Split out dedicated test for unbuffered `open()`
- Split out dedicated test for `is_mount()` at the filesystem root
- Avoid `os.stat()` when checking that empty paths point to '.'
- Remove unnecessary `rmtree()` call
- Remove unused `assertSame()` method
barneygale added a commit that referenced this issue Jul 1, 2023
Make assertions about the `st_mode`, `st_ino` and `st_dev` attributes of
the stat results from two files and a directory, rather than checking if
`chmod()` affects `st_mode` (which is already tested elsewhere).
manosriram pushed a commit to manosriram/cpython that referenced this issue Jul 1, 2023
…onGH-106061)

Remove `PathTest.dirlink()` function. Symlinks in `PathTest.setUp()` are
created using `os.symlink()` directly; symlinks in test functions use
`Path.symlink_to()` in order to make the tests applicable to a
(near-)future `AbstractPath` class.
barneygale added a commit to barneygale/cpython that referenced this issue Jul 3, 2023
@barneygale
Copy link
Contributor

barneygale commented Jul 3, 2023

PR available that adds tarfile.TarPath! #106337

@barneygale
Copy link
Contributor

barneygale commented Jul 9, 2023

What should tarfile.TarPath do with symlinks in archives? In the PR above they're followed in the same circumstances as in pathlib.PosixPath, but that might not be desirable/safe default. We could add an initialiser argument that disabled following symlinks entirely - either omit them or represent them as regular files, somehow?

If the answer isn't clear then I'd like to incubate this in a PyPI project for a while first

@encukou I saw you were working on tarfile recently. Pinging in case this interests you!

@encukou
Copy link
Member

encukou commented Jul 10, 2023

IMO, symlinks to files that are in the archive should continue to work.
Perhaps relative symlinks to files within the archive root should work too, though these become dangerous if there are any directory symlinks pointing outside.
In general, if an archive contains directory symlinks, you can only get the final destination of a file by extracting the archive (or by doing a very elaborate dry run, for which tarfile isn't suited).

It's very much not clear-cut. A tarball is, essentially, a collection of procedural instructions that allow merging contents with an existing filesystem in ”interesting” ways, while TarPath would like to work with a final tree structure.
IMO, to answer questions like this, we first need a good conceptual model of what TarPath represents.

A PyPI project that handles the simple cases sounds like a good start.

@barneygale
Copy link
Contributor

Thank you, @encukou! I'll move the TarPath implementation in #106337 into an experimental PyPI package then, leaving just the (private) pathlib._VirtualPath in that PR.

barneygale added a commit to barneygale/cpython that referenced this issue Jul 12, 2023
@encukou
Copy link
Member

encukou commented Jul 13, 2023

we first need a conceptual model of what TarPath represents

FWIW, I won't make one any time soon -- I'm taking a break from tarfile after a few months of dealing with it.
But if you want to bounce ideas off me, I'm here.

barneygale added a commit to barneygale/cpython that referenced this issue Jul 19, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Aug 28, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Sep 2, 2023
barneygale added a commit to barneygale/cpython that referenced this issue Sep 26, 2023
barneygale added a commit that referenced this issue Sep 30, 2023
Add private `pathlib._PathBase` class. This will be used by an experimental PyPI package to incubate a `tarfile.TarPath` class.

Co-authored-by: Adam Turner <[email protected]>
Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024
Add private `pathlib._PathBase` class. This will be used by an experimental PyPI package to incubate a `tarfile.TarPath` class.

Co-authored-by: Adam Turner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
Status: No status
Development

No branches or pull requests

4 participants