Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pandas-path on the right side of / with cloudpathlib #12

Open
ejm714 opened this issue Oct 21, 2024 · 8 comments
Open

Using pandas-path on the right side of / with cloudpathlib #12

ejm714 opened this issue Oct 21, 2024 · 8 comments
Labels
question Further information is requested

Comments

@ejm714
Copy link

ejm714 commented Oct 21, 2024

I'm trying to create a dataframe with two column, one that has the original filepath and one that has the destination filepath so that I can iterate over the dataframe and do something like

for row in df.itertuples():
   row.original_filepath.copy(row.final_filepath)

These paths will be s3 paths.

However, when I go to set up the filepaths with chaining, the second slash in s3:// gets dropped, making the result an invalid s3path

> S3_DATA_DIR / "raw" 
S3Path('s3://mybucket/data/raw')

> str(S3_DATA_DIR / "raw")
's3://mybucket/data/raw'

> str(S3_DATA_DIR / "raw") / df.audio_string.path
0        s3:/mybucket/data/r...
1        s3:/mybucket/data/r...
@ejm714 ejm714 added the bug Something isn't working label Oct 21, 2024
@pjbull
Copy link
Member

pjbull commented Oct 21, 2024

What is df.audio_string.path? Are both s3 paths and does this happen just using cloudpathlib paths?

This comment may be helpful in showing why that happens:
drivendataorg/cloudpathlib#469 (comment)

@jayqi
Copy link
Member

jayqi commented Oct 21, 2024

@ejm714 pandas-path doesn't support cloudpathlib (edit:) out-of-the-box, and the behavior you're observing is correct behavior for regular paths. When it converts the singleton string "s3://mybucket/data/raw", it resolves the repeated slash into a single one, which is expected behavior for Posix paths.

from pathlib import Path

Path("s3://mybucket/data/raw")
#> PosixPath('s3:/mybucket/data/raw')

@pjbull
Copy link
Member

pjbull commented Oct 21, 2024

Also, see here for cloupathlib support: https://github.com/drivendataorg/pandas-path?tab=readme-ov-file#custom-path-accessors

@jayqi jayqi added question Further information is requested and removed bug Something isn't working labels Oct 21, 2024
@jayqi jayqi closed this as completed Oct 21, 2024
@ejm714
Copy link
Author

ejm714 commented Oct 21, 2024

What is df.audio_string.path?

df.audio_string is a series with dtype object. I was trying to join an s3path with a series and following the instructions in limitation 3 because we have a path object on the left hand side of a join

The same behavior happens if S3_DATA_DIR is a CloudPath instead of S3Path

I don't see a way that the custom path accessor helps here but let me know if I'm missing something. I think a join with an s3 path on the left side is just an unsupported case

@jayqi jayqi reopened this Oct 21, 2024
@pjbull
Copy link
Member

pjbull commented Oct 21, 2024

I think this is a cloudpathlib bug where we we should raise NotImplemented rather than TypeError so __rtruediv__ gets tried as well see Python docs for explanation.

import pandas as pd
from pandas_path import path
from cloudpathlib import CloudPath

CloudPath("s3://bucket/path") / pd.Series(["a", "b", "c"]).path
#> Traceback (most recent call last):
#>   File "<string>", line 1, in <module>
#>   File "/Users/bull/miniconda3/envs/sandbox/lib/python3.11/site-packages/cloudpathlib/cloudpath.py", line 891, in __truediv__
#>     raise TypeError(f"Can only join path {repr(self)} with strings or posix paths.")
#> TypeError: Can only join path S3Path('s3://bucket/path') with strings or posix paths.

@jayqi
Copy link
Member

jayqi commented Oct 21, 2024

Ironic because the bug in Limitation 3 has been fixed for many years as of Python 3.8 and we made a mistake of a similar conceptual vein in cloudpathlib.

@jayqi
Copy link
Member

jayqi commented Oct 21, 2024

I would expect the below to work portably (if cloudpathlib was working correctly). @pjbull's example would still break on a Windows computer.

edited per below comment to use PurePosixPath instead of PosixPath

from pathlib import PurePosixPath

import pandas as pd
from pandas_path import path, register_path_accessor
from cloudpathlib import S3Path

register_path_accessor("pure_posix_path", PurePosixPath)

S3Path("s3://bucket/dir") / pd.Series(['a', 'b', 'c']).pure_posix_path
#> Traceback (most recent call last):
#>   File "<string>", line 1, in <module>
#>   File "/Users/jqi/Downloads/pp-test/.venv/lib/python3.12/site-packages/cloudpathlib/cloudpath.py", line 891, in __truediv__
#>     raise TypeError(f"Can only join path {repr(self)} with strings or posix paths.")
#> TypeError: Can only join path S3Path('s3://bucket/dir') with strings or posix paths.

@pjbull
Copy link
Member

pjbull commented Oct 21, 2024

Good point. I believe that would also break on windows so should be PurePosixPath.

@jayqi jayqi changed the title Chaining operation drops second slash in s3:// Using pandas-path on the right side of / with cloudpathlib Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants