Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add add_filename to pl.read_csv (and read operations others) #19266

Open
mkleinbort-wl opened this issue Oct 16, 2024 · 5 comments
Open

Add add_filename to pl.read_csv (and read operations others) #19266

mkleinbort-wl opened this issue Oct 16, 2024 · 5 comments
Labels
enhancement New feature or an improvement of an existing feature good first issue Good for newcomers

Comments

@mkleinbort-wl
Copy link

mkleinbort-wl commented Oct 16, 2024

Description

It is ocasionaly true that the filename of a data file is fairly critical information

Illustratively

Users/
   Alice.csv
   Bob.csv
   Charlie.csv

When using glob patterns to read this data, the file name itself is lost - which all but forces the user to loop over the files and read them manually.

# This does not preserve what row is for what user
df = pl.read_csv('Users/*.csv') 

 # This is a bit long
df = (pl.concat([
                pl.read_csv(file).with_columns(filename=pl.lit(file) 
                for file in glob('Users/*.csv')
            ])
        )

A parameter to add a column with the specific file name when reading data via a glob pattern would be a nice to have.

@mkleinbort-wl mkleinbort-wl added the enhancement New feature or an improvement of an existing feature label Oct 16, 2024
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Oct 16, 2024

include_file_paths was added for most of the formats: #17563

>>> pl.scan_csv("*.csv", include_file_paths="filename").collect()
shape: (2, 4)
┌─────┬─────┬─────┬──────────┐
│ abcfilename │
│ ------------      │
│ i64i64i64str      │
╞═════╪═════╪═════╪══════════╡
│ 123a.csv    │
│ 456b.csv    │
└─────┴─────┴─────┴──────────┘

Seems it just needs to be exposed via read_csv

@mkleinbort-wl
Copy link
Author

Thank you, I was on an old version of Polars and had not noticed.
Adding it to the eager methods would be nice.

@ritchie46
Copy link
Member

Yes, let's expose this to the eager methods a well.

@ritchie46 ritchie46 added the good first issue Good for newcomers label Oct 17, 2024
@mcrumiller
Copy link
Contributor

This would greatly benefit from using a categorical for the include_file_paths columns, no? Presumably the number of records is typically much greater than the number of files.

@alonme
Copy link
Contributor

alonme commented Oct 17, 2024

Trying to tackle this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants