Make to_numeric default to correct precision #36149

Dr-Irv · 2020-09-05T20:59:20Z

closes to_numeric with errors = "coerce" is adding digits at the end #31364
tests added / passed
- tools/test_to_numeric.py:test_precision
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This relates to a very old issue #8002 where the default precision for CSV files could create wrong answers in the last bit. @jreback was involved in the PR review for that, which created the default to not be the high precision parser.

For to_numeric(), I switched it to use precise_xstrtod instead of xstrtod by default.

Unknown whether there are performance implications for to_numeric(), although this comment #17154 (comment) from @jorisvandenbossche indicates that maybe we should consider switching to the higher precision parser by default in read_csv() anyway.

Open question as to whether performance analysis of this PR is needed beyond what is shown below, which seems to indicate that using precise_xstrtod is faster than xstrtod

The other possibility is to use a keyword argument in to_numeric that would allow the higher precision parser.

Dr-Irv · 2020-09-06T13:12:28Z

So I did a performance analysis of to_numeric using the current way xstrtod and the new way precise_xstrtod. The penalty is minimal. This indicates that we can make the switch here and probably change the default in read_csv:
Version 1.1.1:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas._libs import lib

In [4]: nl=np.array(list(str(i) for i in np.random.randn(1000)),dtype='O')

In [5]: nl[:5]
Out[5]:
array(['-1.1021762980719185', '0.3910582729233139',
       '-0.24666167786443186', '-0.9074814529022244',
       '0.3332612417219568'], dtype=object)

In [6]: %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
210 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: pd.__version__
Out[7]: '1.1.1'

In [8]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')

In [9]: %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
20.7 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

With this PR:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas._libs import lib

In [4]: nl=np.array(list(str(i) for i in np.random.randn(1000)),dtype='O')

In [5]: nl[:5]
Out[5]:
array(['-0.3997469965485634', '0.5214931351322711',
       '-0.37588201420103207', '0.00999602150055729', '0.633129133079011'],
      dtype=object)

In [6]:  %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
221 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: pd.__version__
Out[7]: '1.2.0.dev0+275.gea2b93c47.dirty'

In [8]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')

In [9]:  %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
21.5 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Dr-Irv · 2020-09-06T16:13:40Z

Realized that comparing 1.1.1 to my patch doesn't account for compiler differences. So compared master instead:
With master:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.2.0.dev0+272.gcf1aa9eb0.dirty'

In [3]: import numpy as np

In [4]: from pandas._libs import lib

In [5]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')

In [6]:  %timeit -r 25 -n 100 s=lib.maybe_convert_numeric(nl, set(), coerce_num
   ...: eric=False)
22.2 ms ± 276 µs per loop (mean ± std. dev. of 25 runs, 100 loops each)

With the patch

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.2.0.dev0+275.gea2b93c47.dirty'

In [3]: from pandas._libs import lib

In [4]: import numpy as np

In [5]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')

In [6]:  %timeit -r 25 -n 100 s=lib.maybe_convert_numeric(nl, set(), coerce_num
   ...: eric=False)
21.8 ms ± 447 µs per loop (mean ± std. dev. of 25 runs, 100 loops each)

So this illustrates that using precise_xstrtod instead of xstrtod is faster !

pandas/tests/tools/test_to_numeric.py

doc/source/whatsnew/v1.1.2.rst

simonjayhawkins · 2020-09-06T19:01:43Z

could this type of change wait til 1.2.

jorisvandenbossche

Change looks fine to me (no expert in xstrtod though), but agree with @simonjayhawkins that such a non-regression fix (which actually substantially changes the implementation), although probably safe, is better targetted for 1.2

BTW, I think you by accident added a file (maybe from running the docs? in which we should actually gitignore it), doc/example.feather

jreback

pls move the whatsnew note to 1.2 & have an extra file (doc/example.feather) included if you can remove

ping on green.

jreback · 2020-09-07T20:48:46Z

doc/source/whatsnew/v1.1.2.rst

@@ -42,6 +42,7 @@ Bug fixes
 - Bug in :class:`DataFrame` indexing returning an incorrect :class:`Series` in some cases when the series has been altered and a cache not invalidated (:issue:`33675`)
 - Bug in :meth:`DataFrame.corr` causing subsequent indexing lookups to be incorrect (:issue:`35882`)
 - Bug in :meth:`import_optional_dependency` returning incorrect package names in cases where package name is different from import name (:issue:`35948`)
+- Bug in :func:`to_numeric` where float precision was incorrect (:issue:`31364`)


yeah let's move to 1.2

simonjayhawkins · 2020-09-08T12:02:39Z

Change looks fine to me (no expert in xstrtod though), but agree with @simonjayhawkins that such a non-regression fix (which actually substantially changes the implementation), although probably safe, is better targetted for 1.2

sorry for this. if we weren't so close to the release, then it would have sat in master for a while.

jorisvandenbossche · 2020-09-08T12:04:08Z

No need to be sorry ;) regardless of how close a 1.1.x bugfix release would be, there is no need to include this PR in a bug fix release

Dr-Irv · 2020-09-08T13:28:10Z

/azp run

azure-pipelines · 2020-09-08T13:28:19Z

Azure Pipelines successfully started running 1 pipeline(s).

jreback · 2020-09-08T15:30:40Z

thanks @Dr-Irv

Dr-Irv · 2020-09-08T15:57:05Z

@jreback @jorisvandenbossche Based on the tests I did, I'd like to suggest that we change the default float_precision argument for read_csv and read_table to be "high" rather than None. This would address issue #17154

If you agree, I can create a PR for that

jreback · 2020-09-08T16:25:01Z

can u run some perf tests for csv to see how much that impacts things)

Dr-Irv · 2020-09-08T17:46:35Z

can u run some perf tests for csv to see how much that impacts things)

@jreback It shouldn't affect things, and the performance tests above show that the "high" precision is slightly faster than the current default, since that is the only thing changed for to_numeric. Nevertheless, I ran the appropriate asv test:

Assuming I did the asv tests right using asv continuous -f 1.1 upstream/master issue17154 -b ^gil.ParallelReadCSV

[ 50.00%] · For pandas commit acffef2e <issue17154> (round 2/2):
[ 50.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
........................................
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· gil.ParallelReadCSV.time_read_csv                              ok
[ 75.00%] ··· ========== ============
                dtype
              ---------- ------------
                float      116±4ms
                object    15.3±0.7ms
               datetime    123±3ms
              ========== ============

[ 75.00%] · For pandas commit 8c7efd1c <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
..
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· gil.ParallelReadCSV.time_read_csv                              ok
[100.00%] ··· ========== ============
                dtype
              ---------- ------------
                float      124±8ms
                object    15.6±0.3ms
               datetime    125±3ms
              ========== ============


BENCHMARKS NOT SIGNIFICANTLY CHANGED.

So I'll create a separate PR, OK??

jreback · 2020-09-08T17:48:58Z

yep sounds good

jreback · 2020-09-08T17:49:09Z

also can likely just deprecate the option entirely now

Dr-Irv · 2020-09-08T17:54:58Z

also can likely just deprecate the option entirely now

Not sure we can deprecate because of the round_trip option, which corresponds to the Python string-to-double converter, so will just change the default and document it. PR coming soon....

Dr-Irv added 3 commits September 5, 2020 16:48

BUG: fix precision issue with to_numeric

9f40976

merge from master

4b7ae1a

fix lint issue in parse_helper.h

ea2b93c

jreback requested changes Sep 6, 2020

View reviewed changes

pandas/tests/tools/test_to_numeric.py Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.1.2.rst Outdated Show resolved Hide resolved

jreback added Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv labels Sep 6, 2020

jreback added this to the 1.1.2 milestone Sep 6, 2020

add more tests, change whatsnew

362ee98

jorisvandenbossche reviewed Sep 7, 2020

View reviewed changes

jreback requested changes Sep 7, 2020

View reviewed changes

simonjayhawkins modified the milestones: 1.1.2, 1.2 Sep 8, 2020

Dr-Irv added 3 commits September 8, 2020 07:56

remove doc/example.feather

b3911d0

Merge remote-tracking branch 'upstream/master' into issue31364

1a76c58

from 1.1.2 to 1.2

2b7fe2b

Merge remote-tracking branch 'upstream/master' into issue31364

3022ba8

jreback approved these changes Sep 8, 2020

View reviewed changes

jreback merged commit 8c7efd1 into pandas-dev:master Sep 8, 2020

Dr-Irv mentioned this pull request Sep 8, 2020

ENH: Support high precision converters in to_numeric #19463

Closed

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Sep 8, 2020

Make to_numeric default to correct precision (pandas-dev#36149)

070481c

Dr-Irv mentioned this pull request Sep 8, 2020

Change default of float_precision for read_csv and read_table to "high" #36228

Merged

5 tasks

Dr-Irv deleted the issue31364 branch September 18, 2020 11:34

dsaxton mentioned this pull request Sep 20, 2020

BUG: to_numeric casts floats incorrectly #36502

Closed

1 task

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

Make to_numeric default to correct precision (pandas-dev#36149)

93fc3d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make to_numeric default to correct precision #36149

Make to_numeric default to correct precision #36149

Dr-Irv commented Sep 5, 2020 •

edited

Loading

Dr-Irv commented Sep 6, 2020

Dr-Irv commented Sep 6, 2020

simonjayhawkins commented Sep 6, 2020

jorisvandenbossche left a comment

jreback left a comment

jreback Sep 7, 2020

simonjayhawkins commented Sep 8, 2020

jorisvandenbossche commented Sep 8, 2020

Dr-Irv commented Sep 8, 2020

azure-pipelines bot commented Sep 8, 2020

jreback commented Sep 8, 2020

Dr-Irv commented Sep 8, 2020

jreback commented Sep 8, 2020

Dr-Irv commented Sep 8, 2020

jreback commented Sep 8, 2020

jreback commented Sep 8, 2020

Dr-Irv commented Sep 8, 2020

Make to_numeric default to correct precision #36149

Make to_numeric default to correct precision #36149

Conversation

Dr-Irv commented Sep 5, 2020 • edited Loading

Dr-Irv commented Sep 6, 2020

Dr-Irv commented Sep 6, 2020

simonjayhawkins commented Sep 6, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback Sep 7, 2020

Choose a reason for hiding this comment

simonjayhawkins commented Sep 8, 2020

jorisvandenbossche commented Sep 8, 2020

Dr-Irv commented Sep 8, 2020

azure-pipelines bot commented Sep 8, 2020

jreback commented Sep 8, 2020

Dr-Irv commented Sep 8, 2020

jreback commented Sep 8, 2020

Dr-Irv commented Sep 8, 2020

jreback commented Sep 8, 2020

jreback commented Sep 8, 2020

Dr-Irv commented Sep 8, 2020

Dr-Irv commented Sep 5, 2020 •

edited

Loading