Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optional file_index:int column when using source:list[str] | list[Path] with scan_* #13455

Open
gszep opened this issue Jan 5, 2024 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@gszep
Copy link

gszep commented Jan 5, 2024

Description

Filtering rows based on file origin is easy when explicitly iterating through paths

dfs = [ pl.scan_csv(path) for path in Path(data_dir).glob(pattern) ]
dfs[file_index].collect()

Suppose we have 100k files. Does this scale well? Would the following be more efficient if the file_index column was implemented?

df = pl.scan_csv([ path for path in Path(data_dir).glob(pattern) ], file_index=True)
df.filter( pl.col("file_index") == file_index ).collect()

where file_index is a column of integers that does what it says on the tin. I would expect all other logic such as is_in to work well too.

@gszep gszep added the enhancement New feature or an improvement of an existing feature label Jan 5, 2024
@gszep gszep changed the title optional file_id column when using source:list[str] | list[Path] with scan_csv optional file_id column when using source:list[str] | list[Path] with scan_* Jan 5, 2024
@gszep gszep changed the title optional file_id column when using source:list[str] | list[Path] with scan_* optional file_id:int column when using source:list[str] | list[Path] with scan_* Jan 5, 2024
@gszep
Copy link
Author

gszep commented Jan 5, 2024

a current workaround can be

dfs = [ ]
for i,path in enumerate(Path(data_dir).glob(pattern)):
    df = pl.scan_csv(path).with_columns(file_index=pl.lit(i, dtype=pl.Int32))
    dfs.append(df)
df = pl.concat(dfs)
df.filter( pl.col("file_index") == file_index ).collect()

@gszep gszep changed the title optional file_id:int column when using source:list[str] | list[Path] with scan_* optional file_index:int column when using source:list[str] | list[Path] with scan_* Jan 5, 2024
@cmdlineluser
Copy link
Contributor

Just with regards to the functionality: a variation of this that came up previously was adding the filepath as a column.

Wouldn't this be useful to have in the readers as it could also be used when passing a remote-glob directly? e.g. pl.scan_csv("http://.../*/foo*csv")

(DuckDB has filename=true which seems to do this.)

@cmdlineluser
Copy link
Contributor

include_file_path = "column_name" has just been added which probably closes this also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants