[FEA] preserve_order for apply_rows and merge #4997

lmeyerov · 2020-04-23T02:12:06Z

Is your feature request related to a problem? Please describe.

x.apply_rows and x.merge(how=left) do not preserve the order of x. This is often bug incurring and workarounds cause performance drops. We'll often want to take an output col and append to another df, and thus have to undo the damage. Presumably, it'd be way faster to do the op deeper within the lib.

Current merge does sort some really twisted sort semantics from pandas, but it has nothing to do with any of our actual uses in a lot of code, and appears to be just sucking up developer time for everyone at this point.

Describe the solution you'd like

Add a kw arg preserve_order with default True. Speed demons can explicitly flip preserve_order=False if they're ok with adding non-determinism to their output.

Sorted:

x.merge(y, how='left', on='id')

Unsorted:

x.merge(y, how='left', on='id', preserve_order=False

Same thing for apply_rows

Describe alternatives you've considered

Working around. Slow and bug-prone.
Something like sort_by, but harder to capture order-preserving.

The text was updated successfully, but these errors were encountered:

kkraus14 · 2020-04-23T02:59:02Z

We've had numerous conversations about sorting in joins and every time we've come to the conclusion that we do not want to sort by default. This will not change without strong community feedback from a large group of users, so if you require sorted output from joins I would suggest you pass sort=True. For additional context, Pandas does not sort by default either (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) but they have determinism due to the serial nature of their algorithm.

apply_rows already preserves order where if you have a reproducer that indicates it doesn't please share it.

jrhemstad · 2020-04-23T03:48:07Z

See #1781

lmeyerov · 2020-04-23T04:20:25Z

re:merge, see my notes on pandas merge(sort=True) in the issue: it doesn't match any reasonable use case afaict, and definitely not what any programmer would tell you sort=True ought to do. The issue's resolution was asking the pandas people what the point of it was, afaict just drifting since then..

I'll keep an eye open for apply_rows, ended up backing out the most recent code that had the issue again (0.13).

kkraus14 · 2020-04-23T04:25:13Z

The issue's resolution was asking the pandas people what the point of it was, afaict just drifting since then..

That is not true... What was asked of the Pandas developers is what was the expected behavior of combining defining both join column keys as well as index level(s) which Pandas seemed to have very strange behavior for. The sorting has always consistently been on the join keys in the order you specified.

Given there's no reproducer and I don't see a way that the order is not being preserved with apply_rows I'm going to close this issue. Will reopen if there's a reproducer for apply_rows.

lmeyerov · 2020-04-23T04:36:54Z

Just did a quick test for the outer case, confirming expected behavior is happening (at least in pd):

left = pd.DataFrame({
    'id': [2, 1,  1],
    'l1': [2, '1a', '1b']
})

right = pd.DataFrame({
    'id': [0, 2, 1, 3],
    'l2': [0, 2, 1, 3]
})

left.merge(right, how='outer', sort=True)[['id']]

=>

Some reason I thought it was doing => 1 2 0 3 (left in-order then right in-order).

lmeyerov added Needs Triage Need team to review and classify feature request New feature or request labels Apr 23, 2020

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Apr 23, 2020

kkraus14 closed this as completed Apr 23, 2020

philtrade mentioned this issue May 26, 2020

[BUG] cudf.DataFrame.drop_duplicates() alters original row order, pandas doesn't. #5286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] preserve_order for apply_rows and merge #4997

[FEA] preserve_order for apply_rows and merge #4997

lmeyerov commented Apr 23, 2020 •

edited

Loading

kkraus14 commented Apr 23, 2020

jrhemstad commented Apr 23, 2020

lmeyerov commented Apr 23, 2020 •

edited

Loading

kkraus14 commented Apr 23, 2020

lmeyerov commented Apr 23, 2020

[FEA] preserve_order for apply_rows and merge #4997

[FEA] preserve_order for apply_rows and merge #4997

Comments

lmeyerov commented Apr 23, 2020 • edited Loading

kkraus14 commented Apr 23, 2020

jrhemstad commented Apr 23, 2020

lmeyerov commented Apr 23, 2020 • edited Loading

kkraus14 commented Apr 23, 2020

lmeyerov commented Apr 23, 2020

lmeyerov commented Apr 23, 2020 •

edited

Loading

lmeyerov commented Apr 23, 2020 •

edited

Loading