-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make to_numeric default to correct precision #36149
Conversation
So I did a performance analysis of In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: from pandas._libs import lib
In [4]: nl=np.array(list(str(i) for i in np.random.randn(1000)),dtype='O')
In [5]: nl[:5]
Out[5]:
array(['-1.1021762980719185', '0.3910582729233139',
'-0.24666167786443186', '-0.9074814529022244',
'0.3332612417219568'], dtype=object)
In [6]: %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
210 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: pd.__version__
Out[7]: '1.1.1'
In [8]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')
In [9]: %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
20.7 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) With this PR: In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: from pandas._libs import lib
In [4]: nl=np.array(list(str(i) for i in np.random.randn(1000)),dtype='O')
In [5]: nl[:5]
Out[5]:
array(['-0.3997469965485634', '0.5214931351322711',
'-0.37588201420103207', '0.00999602150055729', '0.633129133079011'],
dtype=object)
In [6]: %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
221 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: pd.__version__
Out[7]: '1.2.0.dev0+275.gea2b93c47.dirty'
In [8]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')
In [9]: %timeit s=lib.maybe_convert_numeric(nl, set(), coerce_numeric=False)
21.5 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
Realized that comparing 1.1.1 to my patch doesn't account for compiler differences. So compared master instead: In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '1.2.0.dev0+272.gcf1aa9eb0.dirty'
In [3]: import numpy as np
In [4]: from pandas._libs import lib
In [5]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')
In [6]: %timeit -r 25 -n 100 s=lib.maybe_convert_numeric(nl, set(), coerce_num
...: eric=False)
22.2 ms ± 276 µs per loop (mean ± std. dev. of 25 runs, 100 loops each) With the patch In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '1.2.0.dev0+275.gea2b93c47.dirty'
In [3]: from pandas._libs import lib
In [4]: import numpy as np
In [5]: nl=np.array(list(str(i) for i in np.random.randn(100000)),dtype='O')
In [6]: %timeit -r 25 -n 100 s=lib.maybe_convert_numeric(nl, set(), coerce_num
...: eric=False)
21.8 ms ± 447 µs per loop (mean ± std. dev. of 25 runs, 100 loops each) So this illustrates that using |
could this type of change wait til 1.2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks fine to me (no expert in xstrtod though), but agree with @simonjayhawkins that such a non-regression fix (which actually substantially changes the implementation), although probably safe, is better targetted for 1.2
BTW, I think you by accident added a file (maybe from running the docs? in which we should actually gitignore it), doc/example.feather
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls move the whatsnew note to 1.2 & have an extra file (doc/example.feather) included if you can remove
ping on green.
doc/source/whatsnew/v1.1.2.rst
Outdated
@@ -42,6 +42,7 @@ Bug fixes | |||
- Bug in :class:`DataFrame` indexing returning an incorrect :class:`Series` in some cases when the series has been altered and a cache not invalidated (:issue:`33675`) | |||
- Bug in :meth:`DataFrame.corr` causing subsequent indexing lookups to be incorrect (:issue:`35882`) | |||
- Bug in :meth:`import_optional_dependency` returning incorrect package names in cases where package name is different from import name (:issue:`35948`) | |||
- Bug in :func:`to_numeric` where float precision was incorrect (:issue:`31364`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let's move to 1.2
sorry for this. if we weren't so close to the release, then it would have sat in master for a while. |
No need to be sorry ;) regardless of how close a 1.1.x bugfix release would be, there is no need to include this PR in a bug fix release |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
thanks @Dr-Irv |
@jreback @jorisvandenbossche Based on the tests I did, I'd like to suggest that we change the default If you agree, I can create a PR for that |
can u run some perf tests for csv to see how much that impacts things) |
@jreback It shouldn't affect things, and the performance tests above show that the "high" precision is slightly faster than the current default, since that is the only thing changed for Assuming I did the asv tests right using
So I'll create a separate PR, OK?? |
yep sounds good |
also can likely just deprecate the option entirely now |
Not sure we can deprecate because of the |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This relates to a very old issue #8002 where the default precision for CSV files could create wrong answers in the last bit. @jreback was involved in the PR review for that, which created the default to not be the high precision parser.
For
to_numeric()
, I switched it to useprecise_xstrtod
instead ofxstrtod
by default.Unknown whether there are performance implications for
to_numeric()
, although this comment #17154 (comment) from @jorisvandenbossche indicates that maybe we should consider switching to the higher precision parser by default inread_csv()
anyway.Open question as to whether performance analysis of this PR is needed beyond what is shown below, which seems to indicate that using
precise_xstrtod
is faster thanxstrtod
The other possibility is to use a keyword argument in
to_numeric
that would allow the higher precision parser.