-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cuDF inner join and iloc seemingly broken #13254
Comments
Thanks for reporting, @danieldemarch. Is this reproducible with any data or specific to a certain dataset? |
It's worth noting that by default, cuDF's merge/join operations do not guarantee output ordering or sortedness (https://docs.rapids.ai/api/cudf/legacy/api_docs/api/cudf.dataframe.merge). Does using |
Pardon - so by default, arrays will be essentially randomly reordered after a merge? |
Correct. Please see this section of our documentation that provides more details about result ordering: https://docs.rapids.ai/api/cudf/stable/user_guide/pandas-comparison/#result-ordering |
Thanks much! |
I'll close this in favour of the broader discussions, but just to summarise that this is "working as expected", it's just the expectations are a bit different. If there are cases where you might expect to be able to recreate the original order from the index but can't please do open issues since those would be actual bugs. |
I noticed when answering #13254 that the code example in this section of our documentation was incorrect and the text itself could use some improving. Authors: - Ashwin Srinath (https://github.com/shwina) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #13255
Assume ticks is a list of cuDF DataFrames. If it matters, these cuDF DataFrames were made by converting pandas DataFrames to cuDF.
The .copy() can be removed optionally, I get the same behavior with/without them.
This behavior produces two errors. First, the dataframes are all sorted in chronological order before the inner join. Post-inner join, the sort is wrong. It's like halfway sorted but with random insertions - e.g. a bunch of 04/04/2023 timestamps in a row and then a random o4/03/2023, before back to 04/04/2023.
Second, the "dummy = dfc.iloc[:indy].copy()" produces inconsistent results when running the same block of code twice. It just doesn't work. It seems to cycle between 5-6 different outcomes if I repeatedly run the cell.
Sticking with vanilla pandas does not produce either of the above errors. It's purely the act of converting to cuDF DataFrames that produces this bug.
Very disappointed. This is the sort of thing unit testing is supposed to catch. If it's some bizarre setting on the cuDF DataFrame - no setting should produce this outcome.
The text was updated successfully, but these errors were encountered: