PERF: Cythonize `from_nested_dict` #33485

ShaharNaveh · 2020-04-11T18:00:50Z

I did try to cythonize this function, but the problem is that it also accepting a collections.OrderedDict, which cannot be cdef as dict, but have to cdef as object.

I don't see much of a performance boost, I have ran the full benchmark suite and asv says that the BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Maybe just better to remove the TODO note, thoughts?

topper-123 · 2020-04-11T19:06:00Z

I don’t think the OrderedDict is needed here anymore. It’s likely a leftover from when Pandas supported python < 3.6 when normal dicts weren’t ordered.

Can you try to convert the use of OrderedDict to a normal dict? That will probably give some speedup on your Cython impl.

xref: pandas-dev#33485 (comment)

ShaharNaveh · 2020-04-11T22:12:19Z

ASV benchmarks:

       before           after         ratio
     [c6c53671]       [94977b2b]
     <master>         <TODO-cythonize>
-        1.52±0ms      1.22±0.08ms     0.80  series_methods.NanOps.time_func('prod', 1000000, 'int8')
-     1.11±0.01ms         833±70μs     0.75  series_methods.NanOps.time_func('sum', 1000000, 'int8')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Not sure if it's related to the change.

topper-123

A few comments. I'm not super versed in Cython, but hopefully my comments ae correct.

topper-123 · 2020-04-11T22:06:42Z

pandas/core/frame.py

@@ -1266,7 +1266,7 @@ def from_dict(cls, data, orient="columns", dtype=None, columns=None) -> "DataFra
            if len(data) > 0:
                # TODO speed up Series case
                if isinstance(list(data.values())[0], (Series, dict)):


Not your PR, but this line above is potentially very expensive. Can you change it to:

first_val = next(iter((data.values())), None) if isinstance(first_val, (Series, dict)):

to avoid creating that list. Does this make a difference in your ASVs?

Nice! the ASV results are even better:

before after ratio [c6c53671] [bcb25b91] <master> <TODO-cythonize> - 1.03±0.1ms 748±6μs 0.72 series_methods.NanOps.time_func('sum', 1000000, 'int8') SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE INCREASED.

topper-123 · 2020-04-11T22:13:22Z

pandas/core/frame.py

@@ -1266,7 +1266,7 @@ def from_dict(cls, data, orient="columns", dtype=None, columns=None) -> "DataFra
            if len(data) > 0:
                # TODO speed up Series case
                if isinstance(list(data.values())[0], (Series, dict)):
-                    data = _from_nested_dict(data)
+                    data = lib.from_nested_dict(data)


If you change the above to

data = dict(data) if not type(data) is dict else data # convert OrderedDict data = lib.from_nested_dict(data)

you can change the function interface in lib to def from_nested_dict(dict data) -> dict:, which will simplify things and make it faster too.

topper-123 · 2020-04-11T22:16:25Z

pandas/_libs/lib.pyx

+@cython.boundscheck(False)
+def from_nested_dict(object data) -> dict:
+    cdef:
+        object new_data = collections.defaultdict(dict)


can you set new_data to type dict new_data and make related changes below (use dict.setdefault etc.)? I think that should make fewer calls into Python, making this faster.

topper-123 · 2020-04-11T22:19:39Z

pandas/_libs/lib.pyx

+    data_dct = dict(data)
+
+    for index, dict_iterator in data_dct.items():
+        nested_dict = dict(dict_iterator)


If dict_iterator is a Series, this conversion will be slow. Probably best to just accept the user's choice IMO and accept a slowdown if dict_iterator is not a dict...

ref: pandas-dev#33485 (comment)

alimcmaster1 · 2020-04-12T12:10:50Z

I don’t think the OrderedDict is needed here anymore. It’s likely a leftover from when Pandas supported python < 3.6 when normal dicts weren’t ordered.

Can you try to convert the use of OrderedDict to a normal dict? That will probably give some speedup on your Cython impl.

Agree I started removing our usage of OrderedDict in #30469 think I have a local branch with a bunch more fixed. Will submit soon.

xref: pandas-dev#33485 (comment)

ShaharNaveh · 2020-04-13T09:38:30Z

pandas/_libs/lib.pyx

+    for index, dict_or_series in data.items():
+        for column, value in dict_or_series.items():
+            if column in new_data:
+                new_data[column].update(dict([(index, value)]))


The dict([(index, value)]) seems like I am doing an unneeded round trip, but I could not found another way to do this, any ideas?

Maybe just do the loop more explicit:

for index, dict_or_series in data.items(): for column, value in dict_or_series.items(): if column not in new_data: new_data[column] = {index: value} else: new_data[column][index] = value

?

Aparty from that I got no more comments and LGTM.

xref: pandas-dev#33485 (comment)

WillAyd

Thanks. I think a little strange to implement this in lib since it relies on frame methods that are implemented in frame.py - not sure if that's a blocker yet just thinking out loud

WillAyd · 2020-04-13T17:59:21Z

pandas/_libs/lib.pyx

+        object index, column, value, dict_or_series
+        dict new_data = {}
+
+    for index, dict_or_series in data.items():


Can you keep with using the defaultdict? Should be more performant as it is implemented already in C

Sure, one thing I do have a concern and it's the return type, if it's a defaultdict, the return type must be an object, unless we return it as a dict, e.g

return dict(new_data)

Does it make a difference whether typed as object or dict? I think both are just mapped to PyObject anyway so maybe doesn’t matter?

ShaharNaveh · 2020-04-13T18:43:56Z

Thanks. I think a little strange to implement this in lib since it relies on frame methods that are implemented in frame.py - not sure if that's a blocker yet just thinking out loud

I don't really know where to put it, what you had in mind?

jreback · 2020-04-13T23:47:20Z

Thanks. I think a little strange to implement this in lib since it relies on frame methods that are implemented in frame.py - not sure if that's a blocker yet just thinking out loud

this is a strange comment

lib holds cython code

jreback

i am not sure this is worth it given the increase in complexity

but will lllk soon

WillAyd · 2020-04-14T00:06:23Z

this is a strange comment

lib holds cython code

Yea to clarify my point is that the code moved is just really looping over df.items(). I think this just fragments the code without really offering an improvement (at least from latest ASV results posted)

jbrockmendel · 2020-04-16T17:17:43Z

@MomIsBestFriend if we cant find a performance bump here, better to just remove the TODO comment and move on.

jbrockmendel · 2020-04-23T16:34:57Z

Closing. @MomIsBestFriend if you want to re-open to remove the comment let us know.

PERF: Cythonize from_nested_dict

6d44e55

topper-123 added Performance Memory or execution speed performance Constructors Series/DataFrame/Index/pd.array Constructors labels Apr 11, 2020

topper-123 added this to the 1.1 milestone Apr 11, 2020

MomIsBestFriend added 4 commits April 11, 2020 23:39

Merge remote-tracking branch 'upstream/master' into TODO-cythonize

78ae5ab

List issues

61c841d

Added wrappers

1c7ffbb

Converting the object dict inside the function

94977b2

xref: pandas-dev#33485 (comment)

ShaharNaveh requested a review from topper-123 April 11, 2020 21:35

topper-123 requested changes Apr 11, 2020

View reviewed changes

Avoiding expensive call

bcb25b9

ref: pandas-dev#33485 (comment)

MomIsBestFriend added 5 commits April 13, 2020 12:23

Merge remote-tracking branch 'upstream/master' into TODO-cythonize

b3ebae6

Converting the data to builtin dict, in the python space

7dfbca8

xref: pandas-dev#33485 (comment)

Going less offen to the python space

c8e515f

xref: pandas-dev#33485 (comment)

Got rid of unneeded vars

4829f78

Better perf if we have nested series

5faa02b

xref: pandas-dev#33485 (comment)

ShaharNaveh commented Apr 13, 2020

View reviewed changes

MomIsBestFriend added 4 commits April 13, 2020 13:09

Assigning the new value into a variable

d349b4f

Remove the if statement

4f8afed

Merge remote-tracking branch 'upstream/master' into TODO-cythonize

babaf64

Suggestion by @topper-123

e22fbda

xref: pandas-dev#33485 (comment)

WillAyd requested changes Apr 13, 2020

View reviewed changes

jreback requested changes Apr 13, 2020

View reviewed changes

jbrockmendel closed this Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Cythonize `from_nested_dict` #33485

PERF: Cythonize `from_nested_dict` #33485

ShaharNaveh commented Apr 11, 2020

topper-123 commented Apr 11, 2020

ShaharNaveh commented Apr 11, 2020

topper-123 left a comment

topper-123 Apr 11, 2020

ShaharNaveh Apr 11, 2020

topper-123 Apr 11, 2020

topper-123 Apr 11, 2020

topper-123 Apr 11, 2020 •

edited

Loading

alimcmaster1 commented Apr 12, 2020

ShaharNaveh Apr 13, 2020

topper-123 Apr 13, 2020

WillAyd left a comment

WillAyd Apr 13, 2020

ShaharNaveh Apr 13, 2020

WillAyd Apr 13, 2020

ShaharNaveh commented Apr 13, 2020

jreback commented Apr 13, 2020

jreback left a comment

WillAyd commented Apr 14, 2020

jbrockmendel commented Apr 16, 2020

jbrockmendel commented Apr 23, 2020

PERF: Cythonize from_nested_dict #33485

PERF: Cythonize from_nested_dict #33485

Conversation

ShaharNaveh commented Apr 11, 2020

topper-123 commented Apr 11, 2020

ShaharNaveh commented Apr 11, 2020

topper-123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Apr 11, 2020 • edited Loading

Choose a reason for hiding this comment

alimcmaster1 commented Apr 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShaharNaveh commented Apr 13, 2020

jreback commented Apr 13, 2020

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Apr 14, 2020

jbrockmendel commented Apr 16, 2020

jbrockmendel commented Apr 23, 2020

PERF: Cythonize `from_nested_dict` #33485

PERF: Cythonize `from_nested_dict` #33485

topper-123 Apr 11, 2020 •

edited

Loading