Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] preserve_order for apply_rows and merge #4997

Closed
lmeyerov opened this issue Apr 23, 2020 · 5 comments
Closed

[FEA] preserve_order for apply_rows and merge #4997

lmeyerov opened this issue Apr 23, 2020 · 5 comments
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@lmeyerov
Copy link

lmeyerov commented Apr 23, 2020

Is your feature request related to a problem? Please describe.

x.apply_rows and x.merge(how=left) do not preserve the order of x. This is often bug incurring and workarounds cause performance drops. We'll often want to take an output col and append to another df, and thus have to undo the damage. Presumably, it'd be way faster to do the op deeper within the lib.

Current merge does sort some really twisted sort semantics from pandas, but it has nothing to do with any of our actual uses in a lot of code, and appears to be just sucking up developer time for everyone at this point.

Describe the solution you'd like

Add a kw arg preserve_order with default True. Speed demons can explicitly flip preserve_order=False if they're ok with adding non-determinism to their output.

Sorted:

x.merge(y, how='left', on='id')

Unsorted:

x.merge(y, how='left', on='id', preserve_order=False

Same thing for apply_rows

Describe alternatives you've considered

  • Working around. Slow and bug-prone.
  • Something like sort_by, but harder to capture order-preserving.
@lmeyerov lmeyerov added Needs Triage Need team to review and classify feature request New feature or request labels Apr 23, 2020
@kkraus14
Copy link
Collaborator

We've had numerous conversations about sorting in joins and every time we've come to the conclusion that we do not want to sort by default. This will not change without strong community feedback from a large group of users, so if you require sorted output from joins I would suggest you pass sort=True. For additional context, Pandas does not sort by default either (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) but they have determinism due to the serial nature of their algorithm.

apply_rows already preserves order where if you have a reproducer that indicates it doesn't please share it.

@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Apr 23, 2020
@jrhemstad
Copy link
Contributor

See #1781

@lmeyerov
Copy link
Author

lmeyerov commented Apr 23, 2020

re:merge, see my notes on pandas merge(sort=True) in the issue: it doesn't match any reasonable use case afaict, and definitely not what any programmer would tell you sort=True ought to do. The issue's resolution was asking the pandas people what the point of it was, afaict just drifting since then..

I'll keep an eye open for apply_rows, ended up backing out the most recent code that had the issue again (0.13).

@kkraus14
Copy link
Collaborator

The issue's resolution was asking the pandas people what the point of it was, afaict just drifting since then..

That is not true... What was asked of the Pandas developers is what was the expected behavior of combining defining both join column keys as well as index level(s) which Pandas seemed to have very strange behavior for. The sorting has always consistently been on the join keys in the order you specified.

Given there's no reproducer and I don't see a way that the order is not being preserved with apply_rows I'm going to close this issue. Will reopen if there's a reproducer for apply_rows.

@lmeyerov
Copy link
Author

Just did a quick test for the outer case, confirming expected behavior is happening (at least in pd):

left = pd.DataFrame({
    'id': [2, 1,  1],
    'l1': [2, '1a', '1b']
})

right = pd.DataFrame({
    'id': [0, 2, 1, 3],
    'l2': [0, 2, 1, 3]
})

left.merge(right, how='outer', sort=True)[['id']]

=>

	id
0	0
1	1
2	1
3	2
4	3

Some reason I thought it was doing => 1 2 0 3 (left in-order then right in-order).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

3 participants