optional `file_index:int` column when using `source:list[str] | list[Path]` with `scan_*` #13455

gszep · 2024-01-05T10:17:22Z

Description

Filtering rows based on file origin is easy when explicitly iterating through paths

dfs = [ pl.scan_csv(path) for path in Path(data_dir).glob(pattern) ]
dfs[file_index].collect()

Suppose we have 100k files. Does this scale well? Would the following be more efficient if the file_index column was implemented?

df = pl.scan_csv([ path for path in Path(data_dir).glob(pattern) ], file_index=True)
df.filter( pl.col("file_index") == file_index ).collect()

where file_index is a column of integers that does what it says on the tin. I would expect all other logic such as is_in to work well too.

The text was updated successfully, but these errors were encountered:

gszep · 2024-01-05T10:35:24Z

a current workaround can be

dfs = [ ]
for i,path in enumerate(Path(data_dir).glob(pattern)):
    df = pl.scan_csv(path).with_columns(file_index=pl.lit(i, dtype=pl.Int32))
    dfs.append(df)
df = pl.concat(dfs)
df.filter( pl.col("file_index") == file_index ).collect()

cmdlineluser · 2024-01-21T11:58:27Z

Just with regards to the functionality: a variation of this that came up previously was adding the filepath as a column.

Add a "filename" column option when reading multiple CSVs with globbing #9096

Wouldn't this be useful to have in the readers as it could also be used when passing a remote-glob directly? e.g. pl.scan_csv("http://.../*/foo*csv")

(DuckDB has filename=true which seems to do this.)

cmdlineluser · 2024-07-12T10:18:54Z

include_file_path = "column_name" has just been added which probably closes this also.

feat: Add option to include file path for Parquet, IPC, CSV scans #17563

gszep added the enhancement New feature or an improvement of an existing feature label Jan 5, 2024

gszep changed the title ~~optional file_id column when using source:list[str] | list[Path] with scan_csv~~ optional file_id column when using source:list[str] | list[Path] with scan_* Jan 5, 2024

gszep changed the title ~~optional file_id column when using source:list[str] | list[Path] with scan_*~~ optional file_id:int column when using source:list[str] | list[Path] with scan_* Jan 5, 2024

gszep changed the title ~~optional file_id:int column when using source:list[str] | list[Path] with scan_*~~ optional file_index:int column when using source:list[str] | list[Path] with scan_* Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optional `file_index:int` column when using `source:list[str] | list[Path]` with `scan_*` #13455

optional `file_index:int` column when using `source:list[str] | list[Path]` with `scan_*` #13455

gszep commented Jan 5, 2024 •

edited

Loading

gszep commented Jan 5, 2024 •

edited

Loading

cmdlineluser commented Jan 21, 2024

cmdlineluser commented Jul 12, 2024

optional file_index:int column when using source:list[str] | list[Path] with scan_* #13455

optional file_index:int column when using source:list[str] | list[Path] with scan_* #13455

Comments

gszep commented Jan 5, 2024 • edited Loading

Description

gszep commented Jan 5, 2024 • edited Loading

cmdlineluser commented Jan 21, 2024

cmdlineluser commented Jul 12, 2024

optional `file_index:int` column when using `source:list[str] | list[Path]` with `scan_*` #13455

optional `file_index:int` column when using `source:list[str] | list[Path]` with `scan_*` #13455

gszep commented Jan 5, 2024 •

edited

Loading

gszep commented Jan 5, 2024 •

edited

Loading