-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Cythonize from_nested_dict
#33485
Conversation
I don’t think the OrderedDict is needed here anymore. It’s likely a leftover from when Pandas supported python < 3.6 when normal dicts weren’t ordered. Can you try to convert the use of OrderedDict to a normal dict? That will probably give some speedup on your Cython impl. |
ASV benchmarks:
Not sure if it's related to the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. I'm not super versed in Cython, but hopefully my comments ae correct.
pandas/core/frame.py
Outdated
@@ -1266,7 +1266,7 @@ def from_dict(cls, data, orient="columns", dtype=None, columns=None) -> "DataFra | |||
if len(data) > 0: | |||
# TODO speed up Series case | |||
if isinstance(list(data.values())[0], (Series, dict)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your PR, but this line above is potentially very expensive. Can you change it to:
first_val = next(iter((data.values())), None)
if isinstance(first_val, (Series, dict)):
to avoid creating that list. Does this make a difference in your ASVs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! the ASV results are even better:
before after ratio
[c6c53671] [bcb25b91]
<master> <TODO-cythonize>
- 1.03±0.1ms 748±6μs 0.72 series_methods.NanOps.time_func('sum', 1000000, 'int8')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
@@ -1266,7 +1266,7 @@ def from_dict(cls, data, orient="columns", dtype=None, columns=None) -> "DataFra | |||
if len(data) > 0: | |||
# TODO speed up Series case | |||
if isinstance(list(data.values())[0], (Series, dict)): | |||
data = _from_nested_dict(data) | |||
data = lib.from_nested_dict(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the above to
data = dict(data) if not type(data) is dict else data # convert OrderedDict
data = lib.from_nested_dict(data)
you can change the function interface in lib
to def from_nested_dict(dict data) -> dict:
, which will simplify things and make it faster too.
pandas/_libs/lib.pyx
Outdated
@cython.boundscheck(False) | ||
def from_nested_dict(object data) -> dict: | ||
cdef: | ||
object new_data = collections.defaultdict(dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you set new_data
to type dict new_data
and make related changes below (use dict.setdefault
etc.)? I think that should make fewer calls into Python, making this faster.
pandas/_libs/lib.pyx
Outdated
data_dct = dict(data) | ||
|
||
for index, dict_iterator in data_dct.items(): | ||
nested_dict = dict(dict_iterator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If dict_iterator is a Series, this conversion will be slow. Probably best to just accept the user's choice IMO and accept a slowdown if dict_iterator is not a dict...
Agree I started removing our usage of OrderedDict in #30469 think I have a local branch with a bunch more fixed. Will submit soon. |
pandas/_libs/lib.pyx
Outdated
for index, dict_or_series in data.items(): | ||
for column, value in dict_or_series.items(): | ||
if column in new_data: | ||
new_data[column].update(dict([(index, value)])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dict([(index, value)])
seems like I am doing an unneeded round trip, but I could not found another way to do this, any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just do the loop more explicit:
for index, dict_or_series in data.items():
for column, value in dict_or_series.items():
if column not in new_data:
new_data[column] = {index: value}
else:
new_data[column][index] = value
?
Aparty from that I got no more comments and LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think a little strange to implement this in lib since it relies on frame methods that are implemented in frame.py - not sure if that's a blocker yet just thinking out loud
object index, column, value, dict_or_series | ||
dict new_data = {} | ||
|
||
for index, dict_or_series in data.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you keep with using the defaultdict? Should be more performant as it is implemented already in C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, one thing I do have a concern and it's the return type, if it's a defaultdict
, the return type must be an object
, unless we return it as a dict
, e.g
return dict(new_data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make a difference whether typed as object or dict? I think both are just mapped to PyObject anyway so maybe doesn’t matter?
I don't really know where to put it, what you had in mind? |
this is a strange comment lib holds cython code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not sure this is worth it given the increase in complexity
but will lllk soon
Yea to clarify my point is that the code moved is just really looping over |
@MomIsBestFriend if we cant find a performance bump here, better to just remove the TODO comment and move on. |
Closing. @MomIsBestFriend if you want to re-open to remove the comment let us know. |
I did try to cythonize this function, but the problem is that it also accepting a
collections.OrderedDict
, which cannot becdef
asdict
, but have tocdef
asobject
.I don't see much of a performance boost, I have ran the full benchmark suite and
asv
says that theBENCHMARKS NOT SIGNIFICANTLY CHANGED.
Maybe just better to remove the
TODO
note, thoughts?