Support scientific notation in `write_csv()` #11929

Wainberg · 2023-10-22T00:49:27Z

Description

Currently you can do e.g. write_csv(..., float_precision=6), but this is like f'{value:.6f}' not f'{value:.6g}'. This means that small floating-point numbers will tend to get rounded to 0 when using float_precision. Ideally it would be possible to specify float_format='.6g' like in pandas's to_csv(), but any way of supporting scientific notation would help. I don't think this is addressed by #7475 but could be wrong.

The current work-around I've been using is to add .with_columns(pl.selectors.float().map_elements('{:.12g}'.format)) before the write_csv().

The text was updated successfully, but these errors were encountered:

stinodego · 2023-12-01T04:30:43Z

It would be ideal if we could specify a float_format just like date_format. This would definitely be nice to have.

orlp · 2023-12-01T10:45:01Z

@stinodego I definitely think this particular request is useful, but I would be wary of adding an overly comprehensive float_format.

Float formatting/parsing is actually a surprisingly large bottleneck in reading/writing CSV files, and if you add options to the formatting you either:

Have to generate a separate function for each possible format (this is fine if it's just normal / exponential, but with each extra option you double the number of functions).
Have to dynamically evaluate and apply the format options for each value (this is expensive).

Remember that in Rust format! is a macro, so each call of it is actually the first option, done at compile time.

stinodego · 2023-12-01T11:26:10Z

Maybe it should be a separate parameter scientific_notation=True, then? I think we can agree the current behavior of mixed formats is undesirable:

import polars as pl

df = pl.DataFrame({"float": [1.0e10, 1.0e15, 1.0e20]})
df.write_csv("float.csv")

float
10000000000.0
1000000000000000.0
1e20

Wainberg · 2023-12-01T16:26:10Z

@stinodego would you accept a pull request for an efficient implementation of float_format that matches the behavior of pandas, i.e. taking a Python-style format string? Having the ability to set the precision is vital; the implementation wouldn't be complete without it.

orlp · 2023-12-01T18:26:01Z

@Wainberg I'm not a huge fan of accepting a format string.

What features would you be looking for besides precision and scientific-ness? float_precision is already an option on write_csv.

Wainberg · 2023-12-01T21:32:28Z

I guess most generally it would be nice to control:

the magnitude threshold at which you switch from regular to scientific notation (e.g. 1e-4 for 'g' format specifiers in Python, infinity for 'e', and 0 for 'f')
the number of decimal places OR the number of significant figures for scientific notation, for numbers with magnitudes below the threshold
the number of decimal places for non-scientific notation, for numbers with magnitudes above the threshold

Ideally, you want to be able to do this separately for different columns.

The nice thing about C/Python/Rust-style format specifiers is that support for them is already being planned for Expr.format() in #7133, so you'd be able to reuse the implementation and reduce code duplication.

Of course, that wouldn't give you fine-grained customization over the magnitude threshold for switching from regular to scientifc notation, but usually 'g' vs 'f' vs 'e' is enough choice and in cases where it's not, it's straightforward to use a custom when/then/otherwise statement and convert the float columns to string before writing.

d-reynol · 2024-02-29T11:55:19Z

@Wainberg similarly, having the ability to specify NOT to use scientific notation would be helpful as well. Some tooling understands 5e-6 but others can only parse 0.000005

Wainberg added the enhancement New feature or an improvement of an existing feature label Oct 22, 2023

stinodego added the accepted Ready for implementation label Dec 1, 2023

github-project-automation bot added this to Backlog Dec 1, 2023

github-project-automation bot moved this to Ready in Backlog Dec 1, 2023

d-reynol mentioned this issue Feb 29, 2024

Support for Column-Specific Float Precision #14766

Open

lukeshingles mentioned this issue Jun 21, 2024

feat: Add float_scientific option to write_csv/sink_csv #17111

Merged

ritchie46 closed this as completed in #17111 Jun 24, 2024

github-project-automation bot moved this from Ready to Done in Backlog Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support scientific notation in `write_csv()` #11929

Support scientific notation in `write_csv()` #11929

Wainberg commented Oct 22, 2023 •

edited

Loading

stinodego commented Dec 1, 2023

orlp commented Dec 1, 2023

stinodego commented Dec 1, 2023 •

edited

Loading

Wainberg commented Dec 1, 2023

orlp commented Dec 1, 2023

Wainberg commented Dec 1, 2023

d-reynol commented Feb 29, 2024

Support scientific notation in write_csv() #11929

Support scientific notation in write_csv() #11929

Comments

Wainberg commented Oct 22, 2023 • edited Loading

Description

stinodego commented Dec 1, 2023

orlp commented Dec 1, 2023

stinodego commented Dec 1, 2023 • edited Loading

Wainberg commented Dec 1, 2023

orlp commented Dec 1, 2023

Wainberg commented Dec 1, 2023

d-reynol commented Feb 29, 2024

Support scientific notation in `write_csv()` #11929

Support scientific notation in `write_csv()` #11929

Wainberg commented Oct 22, 2023 •

edited

Loading

stinodego commented Dec 1, 2023 •

edited

Loading