-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv returns different float values for same number #17154
Comments
I suspect this is expected behavior, by default we use a 'fast' In [14]: df = pd.read_csv(StringIO("""-15.361
...: -15.361000"""), header=None, float_precision='high')
In [15]: df.iloc[0, 0] == df.iloc[1, 0]
Out[15]: True That said, you are welcome to take a look at our implementation to see if this can be fixed in the fast one without a performance impact - |
@chris-b1 : I think that is the reason as well, though three decimals places does seem a little short for saying "not high float precision". Seems like it could either be a "bug fix" or an enhancement in which we can allow for higher float precision without much performance impact. |
In fact, even adding just a single 0 can result in errors. test.csv
|
Trying to get exact equality out of floating points is generally a losing battle, doubly so with a lossy format like csv - do use one of |
@chris-b1 : Where is |
Also, my apologies if this is a dumb question, but why not just use numpy's conversion functions? After all, the data gets stored as numpy dtypes anyways. e.g. If |
pandas/pandas/_libs/parsers.pyx Line 488 in 8e6b09f
Then is ultimately called here pandas/pandas/_libs/parsers.pyx Line 1782 in 8e6b09f
Not a dumb question, but you might answer it yourself by looking at the above code - the pandas read_csv parser is a heavily optimized path, calling almost entirely c-functions, and at that particular calling site doesn't hold the python GIL. So for performance reasons we use our own. |
I see. Is there any reason why |
@chrisyeh96 : You can benchmark against some large datasets and see what happens. That would be a useful step for determining whether we should change the default. |
A small test seems to suggest there is no difference in performance between default and high:
but will need some more extensive testing to confirm this |
Thanks @jreback! Glad this finally is resolved 👍 |
Code Sample, a copy-pastable example if possible
test.csv
Problem description / Expected output
The expected output of the code above is
We should expect both
-15.361
and-15.361000
to be converted to the samenp.float64
representation. However, they are converted to different float values, differing in exactly the last bit of their floating point representation. For some reason,-15.361
gets converted incorrectly to0xC02EB8D4FDF3B645
whereas-15.361000
is correctly to0xC02EB8D4FDF3B646
.For completeness, here are some more comparisons
x.loc[1, 0]
is equal (==
) tonp.float64('-15.361')
,np.float64('-15.361000')
, andfloat('-15.361000')
.x.loc[0, 0]
is not equal to any of those.Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-73-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.4
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: