read_csv returns different float values for same number #17154

chrisyeh96 · 2017-08-02T10:49:57Z

Code Sample, a copy-pastable example if possible

test.csv

-15.361
-15.361000

>>> import pandas as pd
>>> x = pd.read_csv('test.csv', header=None)
>>> x.loc[0, 0] == x.loc[1, 0]
False

Problem description / Expected output

The expected output of the code above is

>>> x.loc[0, 0] == x.loc[1, 0]
True

We should expect both -15.361 and -15.361000 to be converted to the same np.float64 representation. However, they are converted to different float values, differing in exactly the last bit of their floating point representation. For some reason, -15.361 gets converted incorrectly to 0xC02EB8D4FDF3B645 whereas -15.361000 is correctly to 0xC02EB8D4FDF3B646.

For completeness, here are some more comparisons
x.loc[1, 0] is equal (==) to np.float64('-15.361'), np.float64('-15.361000'), and float('-15.361000').
x.loc[0, 0] is not equal to any of those.

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-73-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.4
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-08-02T14:58:50Z

I suspect this is expected behavior, by default we use a 'fast' strtod implementation, if you use the argument for the high precision version then the numbers parse identically.

In [14]: df = pd.read_csv(StringIO("""-15.361
    ...: -15.361000"""), header=None, float_precision='high')

In [15]: df.iloc[0, 0] == df.iloc[1, 0]
Out[15]: True

That said, you are welcome to take a look at our implementation to see if this can be fixed in the fast one without a performance impact - xstrtod

gfyoung · 2017-08-02T15:15:38Z

@chris-b1 : I think that is the reason as well, though three decimals places does seem a little short for saying "not high float precision". Seems like it could either be a "bug fix" or an enhancement in which we can allow for higher float precision without much performance impact.

chrisyeh96 · 2017-08-02T18:58:16Z

In fact, even adding just a single 0 can result in errors.

test.csv

-15.361
-15.3610

>>> import pandas as pd
>>> x = pd.read_csv('test.csv', header=None)
>>> x.loc[0, 0] == x.loc[1, 0]
False

chris-b1 · 2017-08-02T19:07:03Z

Trying to get exact equality out of floating points is generally a losing battle, doubly so with a lossy format like csv - do use one of float_precision options if it's important.

chrisyeh96 · 2017-08-02T19:07:41Z

@chris-b1 : Where is xstrtod called? When I search xstrtod across the entire repo, I see many function definitions but nowhere does there seem to be any call to this function.

chrisyeh96 · 2017-08-02T19:12:11Z

Also, my apologies if this is a dumb question, but why not just use numpy's conversion functions? After all, the data gets stored as numpy dtypes anyways.

e.g. If x is our string that we want to convert to a float, then pandas should just call numpy.float64(x)

chris-b1 · 2017-08-02T19:24:42Z

xstrtod get assigned to an instance variable here

pandas/pandas/_libs/parsers.pyx

Line 488 in 8e6b09f

self.parser.double_converter_nogil = xstrtod

Then is ultimately called here

pandas/pandas/_libs/parsers.pyx

Line 1782 in 8e6b09f

data[0] = double_converter(word, &p_end, parser.decimal,

Not a dumb question, but you might answer it yourself by looking at the above code - the pandas read_csv parser is a heavily optimized path, calling almost entirely c-functions, and at that particular calling site doesn't hold the python GIL. So for performance reasons we use our own.

chrisyeh96 · 2017-08-02T20:13:44Z

I see. Is there any reason why float_precision isn't set to high by default? Is the performance penalty really that large? Just looking at the code, it seems like precise_xstrtod() shouldn't be much slower than xstrtod. It may be even faster because it only uses one division operation (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/parser/tokenizer.c#L1778) instead of multiple divisions in a loop (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/parser/tokenizer.c#L1616)

gfyoung · 2017-08-02T21:10:25Z

@chrisyeh96 : You can benchmark against some large datasets and see what happens. That would be a useful step for determining whether we should change the default.

jorisvandenbossche · 2017-08-03T09:27:48Z

A small test seems to suggest there is no difference in performance between default and high:

In [7]: df.to_csv('__temp.csv')

In [8]: %timeit pd.read_csv('__temp.csv', float_precision=None)
2.36 s ± 71.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit pd.read_csv('__temp.csv', float_precision='high')
2.35 s ± 54.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit pd.read_csv('__temp.csv', float_precision='round_trip')
4.98 s ± 130 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

but will need some more extensive testing to confirm this

chrisyeh96 · 2020-09-14T03:59:23Z

Thanks @jreback! Glad this finally is resolved 👍

gfyoung added the IO CSV read_csv, to_csv label Aug 2, 2017

chris-b1 mentioned this issue Jan 30, 2018

ENH: Support high precision converters in to_numeric #19463

Closed

This was referenced Sep 5, 2020

Make to_numeric default to correct precision #36149

Merged

Change default of float_precision for read_csv and read_table to "high" #36228

Merged

jreback added this to the 1.2 milestone Sep 11, 2020

jreback closed this as completed in #36228 Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv returns different float values for same number #17154

read_csv returns different float values for same number #17154

chrisyeh96 commented Aug 2, 2017

INSTALLED VERSIONS

chris-b1 commented Aug 2, 2017

gfyoung commented Aug 2, 2017

chrisyeh96 commented Aug 2, 2017 •

edited

Loading

chris-b1 commented Aug 2, 2017

chrisyeh96 commented Aug 2, 2017 •

edited

Loading

chrisyeh96 commented Aug 2, 2017

chris-b1 commented Aug 2, 2017 •

edited

Loading

chrisyeh96 commented Aug 2, 2017

gfyoung commented Aug 2, 2017

jorisvandenbossche commented Aug 3, 2017

chrisyeh96 commented Sep 14, 2020

read_csv returns different float values for same number #17154

read_csv returns different float values for same number #17154

Comments

chrisyeh96 commented Aug 2, 2017

Code Sample, a copy-pastable example if possible

Problem description / Expected output

Output of pd.show_versions()

INSTALLED VERSIONS

chris-b1 commented Aug 2, 2017

gfyoung commented Aug 2, 2017

chrisyeh96 commented Aug 2, 2017 • edited Loading

chris-b1 commented Aug 2, 2017

chrisyeh96 commented Aug 2, 2017 • edited Loading

chrisyeh96 commented Aug 2, 2017

chris-b1 commented Aug 2, 2017 • edited Loading

chrisyeh96 commented Aug 2, 2017

gfyoung commented Aug 2, 2017

jorisvandenbossche commented Aug 3, 2017

chrisyeh96 commented Sep 14, 2020

Output of `pd.show_versions()`

chrisyeh96 commented Aug 2, 2017 •

edited

Loading

chrisyeh96 commented Aug 2, 2017 •

edited

Loading

chris-b1 commented Aug 2, 2017 •

edited

Loading