Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make to_numeric default to correct precision #36149

Merged
merged 8 commits into from
Sep 8, 2020

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Sep 5, 2020

This relates to a very old issue #8002 where the default precision for CSV files could create wrong answers in the last bit. @jreback was involved in the PR review for that, which created the default to not be the high precision parser.

For to_numeric(), I switched it to use precise_xstrtod instead of xstrtod by default.

Unknown whether there are performance implications for to_numeric(), although this comment #17154 (comment) from @jorisvandenbossche indicates that maybe we should consider switching to the higher precision parser by default in read_csv() anyway.

Open question as to whether performance analysis of this PR is needed beyond what is shown below, which seems to indicate that using precise_xstrtod is faster than xstrtod

The other possibility is to use a keyword argument in to_numeric that would allow the higher precision parser.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Sep 6, 2020

So I did a performance analysis of to_numeric using the current way xstrtod and the new way precise_xstrtod. The penalty is minimal. This indicates that we can make the switch here and probably change the default in read_csv:
Version 1.1.1:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas._libs import lib

In [4]: nl=np.array(list(str(i) for i in np.random.randn(1000)),dtype='O')

In [5]: nl[:5]
Out[5]:
array(['-1.1021762980719185', '0.3910582729233139',
       '-0.24666167786443186', '-0.9074814529022244',
       '0.3332612417219568'], dtype=object)

In [6]: %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
210 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: pd.__version__
Out[7]: '1.1.1'

In [8]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')

In [9]: %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
20.7 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

With this PR:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas._libs import lib

In [4]: nl=np.array(list(str(i) for i in np.random.randn(1000)),dtype='O')

In [5]: nl[:5]
Out[5]:
array(['-0.3997469965485634', '0.5214931351322711',
       '-0.37588201420103207', '0.00999602150055729', '0.633129133079011'],
      dtype=object)

In [6]:  %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
221 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: pd.__version__
Out[7]: '1.2.0.dev0+275.gea2b93c47.dirty'

In [8]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')

In [9]:  %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
21.5 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Sep 6, 2020

Realized that comparing 1.1.1 to my patch doesn't account for compiler differences. So compared master instead:
With master:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.2.0.dev0+272.gcf1aa9eb0.dirty'

In [3]: import numpy as np

In [4]: from pandas._libs import lib

In [5]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')

In [6]:  %timeit -r 25 -n 100 s=lib.maybe_convert_numeric(nl, set(), coerce_num
   ...: eric=False)
22.2 ms ± 276 µs per loop (mean ± std. dev. of 25 runs, 100 loops each)

With the patch

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.2.0.dev0+275.gea2b93c47.dirty'

In [3]: from pandas._libs import lib

In [4]: import numpy as np

In [5]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')

In [6]:  %timeit -r 25 -n 100 s=lib.maybe_convert_numeric(nl, set(), coerce_num
   ...: eric=False)
21.8 ms ± 447 µs per loop (mean ± std. dev. of 25 runs, 100 loops each)

So this illustrates that using precise_xstrtod instead of xstrtod is faster !

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv labels Sep 6, 2020
@jreback jreback added this to the 1.1.2 milestone Sep 6, 2020
@simonjayhawkins
Copy link
Member

could this type of change wait til 1.2.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks fine to me (no expert in xstrtod though), but agree with @simonjayhawkins that such a non-regression fix (which actually substantially changes the implementation), although probably safe, is better targetted for 1.2

BTW, I think you by accident added a file (maybe from running the docs? in which we should actually gitignore it), doc/example.feather

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move the whatsnew note to 1.2 & have an extra file (doc/example.feather) included if you can remove

ping on green.

@@ -42,6 +42,7 @@ Bug fixes
- Bug in :class:`DataFrame` indexing returning an incorrect :class:`Series` in some cases when the series has been altered and a cache not invalidated (:issue:`33675`)
- Bug in :meth:`DataFrame.corr` causing subsequent indexing lookups to be incorrect (:issue:`35882`)
- Bug in :meth:`import_optional_dependency` returning incorrect package names in cases where package name is different from import name (:issue:`35948`)
- Bug in :func:`to_numeric` where float precision was incorrect (:issue:`31364`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's move to 1.2

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.2, 1.2 Sep 8, 2020
@simonjayhawkins
Copy link
Member

Change looks fine to me (no expert in xstrtod though), but agree with @simonjayhawkins that such a non-regression fix (which actually substantially changes the implementation), although probably safe, is better targetted for 1.2

sorry for this. if we weren't so close to the release, then it would have sat in master for a while.

@jorisvandenbossche
Copy link
Member

No need to be sorry ;) regardless of how close a 1.1.x bugfix release would be, there is no need to include this PR in a bug fix release

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Sep 8, 2020

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

@jreback jreback merged commit 8c7efd1 into pandas-dev:master Sep 8, 2020
@jreback
Copy link
Contributor

jreback commented Sep 8, 2020

thanks @Dr-Irv

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Sep 8, 2020

@jreback @jorisvandenbossche Based on the tests I did, I'd like to suggest that we change the default float_precision argument for read_csv and read_table to be "high" rather than None. This would address issue #17154

If you agree, I can create a PR for that

@jreback
Copy link
Contributor

jreback commented Sep 8, 2020

can u run some perf tests for csv to see how much that impacts things)

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Sep 8, 2020
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Sep 8, 2020

can u run some perf tests for csv to see how much that impacts things)

@jreback It shouldn't affect things, and the performance tests above show that the "high" precision is slightly faster than the current default, since that is the only thing changed for to_numeric. Nevertheless, I ran the appropriate asv test:

Assuming I did the asv tests right using asv continuous -f 1.1 upstream/master issue17154 -b ^gil.ParallelReadCSV

[ 50.00%] · For pandas commit acffef2e <issue17154> (round 2/2):
[ 50.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
........................................
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· gil.ParallelReadCSV.time_read_csv                              ok
[ 75.00%] ··· ========== ============
                dtype
              ---------- ------------
                float      116±4ms
                object    15.3±0.7ms
               datetime    123±3ms
              ========== ============

[ 75.00%] · For pandas commit 8c7efd1c <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
..
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· gil.ParallelReadCSV.time_read_csv                              ok
[100.00%] ··· ========== ============
                dtype
              ---------- ------------
                float      124±8ms
                object    15.6±0.3ms
               datetime    125±3ms
              ========== ============


BENCHMARKS NOT SIGNIFICANTLY CHANGED.

So I'll create a separate PR, OK??

@jreback
Copy link
Contributor

jreback commented Sep 8, 2020

yep sounds good

@jreback
Copy link
Contributor

jreback commented Sep 8, 2020

also can likely just deprecate the option entirely now

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Sep 8, 2020

also can likely just deprecate the option entirely now

Not sure we can deprecate because of the round_trip option, which corresponds to the Python string-to-double converter, so will just change the default and document it. PR coming soon....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_numeric with errors = "coerce" is adding digits at the end
4 participants