Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv returns different float values for same number #17154

Closed
chrisyeh96 opened this issue Aug 2, 2017 · 11 comments · Fixed by #36228
Closed

read_csv returns different float values for same number #17154

chrisyeh96 opened this issue Aug 2, 2017 · 11 comments · Fixed by #36228
Labels
IO CSV read_csv, to_csv
Milestone

Comments

@chrisyeh96
Copy link
Contributor

Code Sample, a copy-pastable example if possible

test.csv

-15.361
-15.361000
>>> import pandas as pd
>>> x = pd.read_csv('test.csv', header=None)
>>> x.loc[0, 0] == x.loc[1, 0]
False

Problem description / Expected output

The expected output of the code above is

>>> x.loc[0, 0] == x.loc[1, 0]
True

We should expect both -15.361 and -15.361000 to be converted to the same np.float64 representation. However, they are converted to different float values, differing in exactly the last bit of their floating point representation. For some reason, -15.361 gets converted incorrectly to 0xC02EB8D4FDF3B645 whereas -15.361000 is correctly to 0xC02EB8D4FDF3B646.

For completeness, here are some more comparisons
x.loc[1, 0] is equal (==) to np.float64('-15.361'), np.float64('-15.361000'), and float('-15.361000').
x.loc[0, 0] is not equal to any of those.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-73-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.4
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

chris-b1 commented Aug 2, 2017

I suspect this is expected behavior, by default we use a 'fast' strtod implementation, if you use the argument for the high precision version then the numbers parse identically.

In [14]: df = pd.read_csv(StringIO("""-15.361
    ...: -15.361000"""), header=None, float_precision='high')

In [15]: df.iloc[0, 0] == df.iloc[1, 0]
Out[15]: True

That said, you are welcome to take a look at our implementation to see if this can be fixed in the fast one without a performance impact - xstrtod

@gfyoung gfyoung added the IO CSV read_csv, to_csv label Aug 2, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 2, 2017

@chris-b1 : I think that is the reason as well, though three decimals places does seem a little short for saying "not high float precision". Seems like it could either be a "bug fix" or an enhancement in which we can allow for higher float precision without much performance impact.

@chrisyeh96
Copy link
Contributor Author

chrisyeh96 commented Aug 2, 2017

In fact, even adding just a single 0 can result in errors.

test.csv

-15.361
-15.3610
>>> import pandas as pd
>>> x = pd.read_csv('test.csv', header=None)
>>> x.loc[0, 0] == x.loc[1, 0]
False

@chris-b1
Copy link
Contributor

chris-b1 commented Aug 2, 2017

Trying to get exact equality out of floating points is generally a losing battle, doubly so with a lossy format like csv - do use one of float_precision options if it's important.

@chrisyeh96
Copy link
Contributor Author

chrisyeh96 commented Aug 2, 2017

@chris-b1 : Where is xstrtod called? When I search xstrtod across the entire repo, I see many function definitions but nowhere does there seem to be any call to this function.

@chrisyeh96
Copy link
Contributor Author

Also, my apologies if this is a dumb question, but why not just use numpy's conversion functions? After all, the data gets stored as numpy dtypes anyways.

e.g. If x is our string that we want to convert to a float, then pandas should just call numpy.float64(x)

@chris-b1
Copy link
Contributor

chris-b1 commented Aug 2, 2017

xstrtod get assigned to an instance variable here

self.parser.double_converter_nogil = xstrtod

Then is ultimately called here

data[0] = double_converter(word, &p_end, parser.decimal,

Not a dumb question, but you might answer it yourself by looking at the above code - the pandas read_csv parser is a heavily optimized path, calling almost entirely c-functions, and at that particular calling site doesn't hold the python GIL. So for performance reasons we use our own.

@chrisyeh96
Copy link
Contributor Author

I see. Is there any reason why float_precision isn't set to high by default? Is the performance penalty really that large? Just looking at the code, it seems like precise_xstrtod() shouldn't be much slower than xstrtod. It may be even faster because it only uses one division operation (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/parser/tokenizer.c#L1778) instead of multiple divisions in a loop (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/parser/tokenizer.c#L1616)

@gfyoung
Copy link
Member

gfyoung commented Aug 2, 2017

@chrisyeh96 : You can benchmark against some large datasets and see what happens. That would be a useful step for determining whether we should change the default.

@jorisvandenbossche
Copy link
Member

A small test seems to suggest there is no difference in performance between default and high:

In [7]: df.to_csv('__temp.csv')

In [8]: %timeit pd.read_csv('__temp.csv', float_precision=None)
2.36 s ± 71.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit pd.read_csv('__temp.csv', float_precision='high')
2.35 s ± 54.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit pd.read_csv('__temp.csv', float_precision='round_trip')
4.98 s ± 130 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

but will need some more extensive testing to confirm this

@chrisyeh96
Copy link
Contributor Author

Thanks @jreback! Glad this finally is resolved 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants