Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cuDF inner join and iloc seemingly broken #13254

Closed
danieldemarch opened this issue Apr 30, 2023 · 7 comments
Closed

[BUG] cuDF inner join and iloc seemingly broken #13254

danieldemarch opened this issue Apr 30, 2023 · 7 comments
Labels
Python Affects Python cuDF API.

Comments

@danieldemarch
Copy link

danieldemarch commented Apr 30, 2023

Assume ticks is a list of cuDF DataFrames. If it matters, these cuDF DataFrames were made by converting pandas DataFrames to cuDF.

dfa= ticks[i].copy()
dfb= ticks[j].copy()

dfc= dfa.join(dfb, how='inner', rsuffix="2").copy()

indy = int(len(dfc)*0.65)

dummy = dfc.iloc[:indy].copy()

print(dummy.tail(5))

The .copy() can be removed optionally, I get the same behavior with/without them.

This behavior produces two errors. First, the dataframes are all sorted in chronological order before the inner join. Post-inner join, the sort is wrong. It's like halfway sorted but with random insertions - e.g. a bunch of 04/04/2023 timestamps in a row and then a random o4/03/2023, before back to 04/04/2023.

Second, the "dummy = dfc.iloc[:indy].copy()" produces inconsistent results when running the same block of code twice. It just doesn't work. It seems to cycle between 5-6 different outcomes if I repeatedly run the cell.

Sticking with vanilla pandas does not produce either of the above errors. It's purely the act of converting to cuDF DataFrames that produces this bug.

Very disappointed. This is the sort of thing unit testing is supposed to catch. If it's some bizarre setting on the cuDF DataFrame - no setting should produce this outcome.

@danieldemarch danieldemarch added Needs Triage Need team to review and classify bug Something isn't working labels Apr 30, 2023
@shwina
Copy link
Contributor

shwina commented May 1, 2023

Thanks for reporting, @danieldemarch. Is this reproducible with any data or specific to a certain dataset?

@shwina
Copy link
Contributor

shwina commented May 1, 2023

It's worth noting that by default, cuDF's merge/join operations do not guarantee output ordering or sortedness (https://docs.rapids.ai/api/cudf/legacy/api_docs/api/cudf.dataframe.merge).

Does using sort=True in the call to join help?

@danieldemarch
Copy link
Author

Pardon - so by default, arrays will be essentially randomly reordered after a merge?

@shwina
Copy link
Contributor

shwina commented May 1, 2023

Correct. Please see this section of our documentation that provides more details about result ordering: https://docs.rapids.ai/api/cudf/stable/user_guide/pandas-comparison/#result-ordering

@shwina shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 1, 2023
@vyasr
Copy link
Contributor

vyasr commented May 1, 2023

There has also been a lot of discussion on this in #1781 (and #5286, although that's referring to a different function but the same sorting-related behavior).

@danieldemarch
Copy link
Author

Thanks much!

@wence- wence- added not a bug and removed bug Something isn't working labels May 2, 2023
@wence-
Copy link
Contributor

wence- commented May 2, 2023

I'll close this in favour of the broader discussions, but just to summarise that this is "working as expected", it's just the expectations are a bit different.

If there are cases where you might expect to be able to recreate the original order from the index but can't please do open issues since those would be actual bugs.

@wence- wence- closed this as completed May 2, 2023
rapids-bot bot pushed a commit that referenced this issue Apr 12, 2024
I noticed when answering #13254 that the code example in this section of our documentation was incorrect and the text itself could use some improving.

Authors:
  - Ashwin Srinath (https://github.com/shwina)
  - Lawrence Mitchell (https://github.com/wence-)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13255
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

4 participants