Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support scientific notation in write_csv() #11929

Closed
Wainberg opened this issue Oct 22, 2023 · 7 comments · Fixed by #17111
Closed

Support scientific notation in write_csv() #11929

Wainberg opened this issue Oct 22, 2023 · 7 comments · Fixed by #17111
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@Wainberg
Copy link
Contributor

Wainberg commented Oct 22, 2023

Description

Currently you can do e.g. write_csv(..., float_precision=6), but this is like f'{value:.6f}' not f'{value:.6g}'. This means that small floating-point numbers will tend to get rounded to 0 when using float_precision. Ideally it would be possible to specify float_format='.6g' like in pandas's to_csv(), but any way of supporting scientific notation would help. I don't think this is addressed by #7475 but could be wrong.

The current work-around I've been using is to add .with_columns(pl.selectors.float().map_elements('{:.12g}'.format)) before the write_csv().

@Wainberg Wainberg added the enhancement New feature or an improvement of an existing feature label Oct 22, 2023
@stinodego
Copy link
Member

It would be ideal if we could specify a float_format just like date_format. This would definitely be nice to have.

@stinodego stinodego added the accepted Ready for implementation label Dec 1, 2023
@github-project-automation github-project-automation bot moved this to Ready in Backlog Dec 1, 2023
@orlp
Copy link
Collaborator

orlp commented Dec 1, 2023

@stinodego I definitely think this particular request is useful, but I would be wary of adding an overly comprehensive float_format.

Float formatting/parsing is actually a surprisingly large bottleneck in reading/writing CSV files, and if you add options to the formatting you either:

  1. Have to generate a separate function for each possible format (this is fine if it's just normal / exponential, but with each extra option you double the number of functions).
  2. Have to dynamically evaluate and apply the format options for each value (this is expensive).

Remember that in Rust format! is a macro, so each call of it is actually the first option, done at compile time.

@stinodego
Copy link
Member

stinodego commented Dec 1, 2023

Maybe it should be a separate parameter scientific_notation=True, then? I think we can agree the current behavior of mixed formats is undesirable:

import polars as pl

df = pl.DataFrame({"float": [1.0e10, 1.0e15, 1.0e20]})
df.write_csv("float.csv")
float
10000000000.0
1000000000000000.0
1e20

@Wainberg
Copy link
Contributor Author

Wainberg commented Dec 1, 2023

@stinodego would you accept a pull request for an efficient implementation of float_format that matches the behavior of pandas, i.e. taking a Python-style format string? Having the ability to set the precision is vital; the implementation wouldn't be complete without it.

@orlp
Copy link
Collaborator

orlp commented Dec 1, 2023

@Wainberg I'm not a huge fan of accepting a format string.

What features would you be looking for besides precision and scientific-ness? float_precision is already an option on write_csv.

@Wainberg
Copy link
Contributor Author

Wainberg commented Dec 1, 2023

I guess most generally it would be nice to control:

  • the magnitude threshold at which you switch from regular to scientific notation (e.g. 1e-4 for 'g' format specifiers in Python, infinity for 'e', and 0 for 'f')
  • the number of decimal places OR the number of significant figures for scientific notation, for numbers with magnitudes below the threshold
  • the number of decimal places for non-scientific notation, for numbers with magnitudes above the threshold

Ideally, you want to be able to do this separately for different columns.

The nice thing about C/Python/Rust-style format specifiers is that support for them is already being planned for Expr.format() in #7133, so you'd be able to reuse the implementation and reduce code duplication.

Of course, that wouldn't give you fine-grained customization over the magnitude threshold for switching from regular to scientifc notation, but usually 'g' vs 'f' vs 'e' is enough choice and in cases where it's not, it's straightforward to use a custom when/then/otherwise statement and convert the float columns to string before writing.

@d-reynol
Copy link

@Wainberg similarly, having the ability to specify NOT to use scientific notation would be helpful as well. Some tooling understands 5e-6 but others can only parse 0.000005

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants