Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Cythonize from_nested_dict #33485

Closed
wants to merge 15 commits into from

Conversation

ShaharNaveh
Copy link
Member

I did try to cythonize this function, but the problem is that it also accepting a collections.OrderedDict, which cannot be cdef as dict, but have to cdef as object.


I don't see much of a performance boost, I have ran the full benchmark suite and asv says that the BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Maybe just better to remove the TODO note, thoughts?

@topper-123
Copy link
Contributor

I don’t think the OrderedDict is needed here anymore. It’s likely a leftover from when Pandas supported python < 3.6 when normal dicts weren’t ordered.

Can you try to convert the use of OrderedDict to a normal dict? That will probably give some speedup on your Cython impl.

@topper-123 topper-123 added Performance Memory or execution speed performance Constructors Series/DataFrame/Index/pd.array Constructors labels Apr 11, 2020
@topper-123 topper-123 added this to the 1.1 milestone Apr 11, 2020
@ShaharNaveh ShaharNaveh requested a review from topper-123 April 11, 2020 21:35
@ShaharNaveh
Copy link
Member Author

ASV benchmarks:

       before           after         ratio
     [c6c53671]       [94977b2b]
     <master>         <TODO-cythonize>
-        1.52±0ms      1.22±0.08ms     0.80  series_methods.NanOps.time_func('prod', 1000000, 'int8')
-     1.11±0.01ms         833±70μs     0.75  series_methods.NanOps.time_func('sum', 1000000, 'int8')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Not sure if it's related to the change.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. I'm not super versed in Cython, but hopefully my comments ae correct.

@@ -1266,7 +1266,7 @@ def from_dict(cls, data, orient="columns", dtype=None, columns=None) -> "DataFra
if len(data) > 0:
# TODO speed up Series case
if isinstance(list(data.values())[0], (Series, dict)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your PR, but this line above is potentially very expensive. Can you change it to:

first_val = next(iter((data.values())), None)
if isinstance(first_val, (Series, dict)):

to avoid creating that list. Does this make a difference in your ASVs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! the ASV results are even better:

       before           after         ratio
     [c6c53671]       [bcb25b91]
     <master>         <TODO-cythonize>
-      1.03±0.1ms          748±6μs     0.72  series_methods.NanOps.time_func('sum', 1000000, 'int8')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@@ -1266,7 +1266,7 @@ def from_dict(cls, data, orient="columns", dtype=None, columns=None) -> "DataFra
if len(data) > 0:
# TODO speed up Series case
if isinstance(list(data.values())[0], (Series, dict)):
data = _from_nested_dict(data)
data = lib.from_nested_dict(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the above to

data = dict(data) if not type(data) is dict else data  # convert OrderedDict
data = lib.from_nested_dict(data)

you can change the function interface in lib to def from_nested_dict(dict data) -> dict:, which will simplify things and make it faster too.

@cython.boundscheck(False)
def from_nested_dict(object data) -> dict:
cdef:
object new_data = collections.defaultdict(dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you set new_data to type dict new_data and make related changes below (use dict.setdefault etc.)? I think that should make fewer calls into Python, making this faster.

data_dct = dict(data)

for index, dict_iterator in data_dct.items():
nested_dict = dict(dict_iterator)
Copy link
Contributor

@topper-123 topper-123 Apr 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dict_iterator is a Series, this conversion will be slow. Probably best to just accept the user's choice IMO and accept a slowdown if dict_iterator is not a dict...

@alimcmaster1
Copy link
Member

I don’t think the OrderedDict is needed here anymore. It’s likely a leftover from when Pandas supported python < 3.6 when normal dicts weren’t ordered.

Can you try to convert the use of OrderedDict to a normal dict? That will probably give some speedup on your Cython impl.

Agree I started removing our usage of OrderedDict in #30469 think I have a local branch with a bunch more fixed. Will submit soon.

for index, dict_or_series in data.items():
for column, value in dict_or_series.items():
if column in new_data:
new_data[column].update(dict([(index, value)]))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dict([(index, value)]) seems like I am doing an unneeded round trip, but I could not found another way to do this, any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just do the loop more explicit:

for index, dict_or_series in data.items():
    for column, value in dict_or_series.items():
    	if column not in new_data:
    		new_data[column] = {index: value}
    	else:
    		new_data[column][index] = value

?

Aparty from that I got no more comments and LGTM.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think a little strange to implement this in lib since it relies on frame methods that are implemented in frame.py - not sure if that's a blocker yet just thinking out loud

object index, column, value, dict_or_series
dict new_data = {}

for index, dict_or_series in data.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep with using the defaultdict? Should be more performant as it is implemented already in C

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, one thing I do have a concern and it's the return type, if it's a defaultdict, the return type must be an object, unless we return it as a dict, e.g

return dict(new_data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make a difference whether typed as object or dict? I think both are just mapped to PyObject anyway so maybe doesn’t matter?

@ShaharNaveh
Copy link
Member Author

Thanks. I think a little strange to implement this in lib since it relies on frame methods that are implemented in frame.py - not sure if that's a blocker yet just thinking out loud

I don't really know where to put it, what you had in mind?

@jreback
Copy link
Contributor

jreback commented Apr 13, 2020

Thanks. I think a little strange to implement this in lib since it relies on frame methods that are implemented in frame.py - not sure if that's a blocker yet just thinking out loud

this is a strange comment

lib holds cython code

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure this is worth it given the increase in complexity

but will lllk soon

@WillAyd
Copy link
Member

WillAyd commented Apr 14, 2020

this is a strange comment

lib holds cython code

Yea to clarify my point is that the code moved is just really looping over df.items(). I think this just fragments the code without really offering an improvement (at least from latest ASV results posted)

@jbrockmendel
Copy link
Member

@MomIsBestFriend if we cant find a performance bump here, better to just remove the TODO comment and move on.

@jbrockmendel
Copy link
Member

Closing. @MomIsBestFriend if you want to re-open to remove the comment let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants