-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent out-of-memory performing join after 0.20.6 update #14201
Comments
I experience the same, with lazy execution. |
I have also seen and endless running process. Polars 0.20.5 is fine. I have tried to make a minimal example, but is not able to reproduce with a something simple. |
Me and my colleagues experience similar issues. Lazy reads (parquet) with filters and selects seemed to read much more data to memory after upgrade then before. Some of us were successful when downgrading to |
I might add, that I am not sure if it is the join of if is the read_csv functions that are slow. |
I'm seeing out of memory crashes also. @david-waterworth are you using the conda-forge build of polars? |
@antonl no I'm using PyPI (via poetry). |
Can you put some numbers and specifics behind this? How long does it usually take? How long does it take in the bad cases? Does it get stuck in an endless loop and never stop or it's just slower than you'd like? What is the cpu activity like in good and bad case? What is the memory usage in the good and bad case? (for example by looking at htop or win task manager) You mention you have nested types, are you talking about just structs or structs with lists? lists with structs? Do you make any mutations, (if so how many and of what sort?) to your key variable before the join? If you cast to categorical before the join, do you still have the issue:
For reference I can run this standalone simulation with no issue.
|
@deanm0000 yes I can also run a similar stand-alone example with the exact same data that causes the issue in my script (i.e. I added a breakpoint and saved the 2 df's as parquet then loaded them in a notebook). If I run with If I run with I have multiple struct fields, list[str] and list[int] and list[struct]. There's several steps as I'm migrating a bunch of manually run notebooks, and I more or less replicated the original steps, so at this points it's not overly optimised or refactored. First I construct the main dataframe from a list (~90k elements) of dicts
The I add rank, I wasn't sure how to do this as a single operation so I generated the rank over a subset and joined it back (the filter is also over a list[int] column.
Then there's a series of other joins, which is where the issue usually surfaces. i.e.
The only mutation of the key variable I think is a rename (which I assume isn't technically a mutation?). But one of the joins that has defn failed does a
If there's any tracing I can run, I'll create a |
Could you try your joins with your strings cast to categorical before joining? That will strengthen the theory that the new string implementation is the cause. |
Yeah I think your theory is correct. I reproduced the issue in my projects 0.20.6 branch. Then I ran again with the key cast to To be 100% sure I ran the original code and broke on the actual line where it was crashing, performed the join in the debug window using So it seems fairly likely that the |
We had to fix the production pipeline partially to |
Can someone please get a reproduction? How should I fix it if I cannot cause it? @david-waterworth can you sent me your script perhaps (and remove parts that don't influence this behavior). I really want to fix this, so help is greatly appreciated. @detrin do you have something that we can use to reproduce this? |
I have relatively free Sunday, I will be glad to prepare a minimal example. |
@ritchie46 I don't know how to profile it well, from python I am getting too great variance to confirm the issue. I know for a fact that at this point of script I was getting an OOM. Here I extracted for you two intermediate tables that have 1000 lines for the sake of an example, in production score they have millions of rows. import polars as pl
from memory_profiler import memory_usage
program = pl.read_csv("https://eu2.contabostorage.com/62824c32198b4d53a08054da7a8b4df1:polarsissue14201/program_example.csv")
source = pl.read_csv("https://eu2.contabostorage.com/62824c32198b4d53a08054da7a8b4df1:polarsissue14201/source_example.csv")
def main():
program_csfd = program.lazy().join(
source.lazy(), how="left", left_on="program_title_norm", right_on="title_norm"
).collect(streaming=True)
if __name__ == '__main__':
mem_usage = memory_usage(main)
# print(f"polars version: {pl.__version__}")
print('Maximum memory usage: %s MB' % max(mem_usage)) |
Have you got some more info? E.g. what you did up to this front? And I likely need more rows, 1000 rows is nothing. (I cannot concatenate that into a large dataset as that will lead to duplicates which will explode the join output). |
I can't provide the whole tables, I could perhaps reduce the number of columns and mask the titles, so that I would still see the increase of RAM used. That requires a lot of trial and error attempts, I am afraid I don't have that kind of time today or following days. However, I might prepare bigger reproducible example during workday. It most likely will be during this next week. |
Thanks @detrin. If I can reproduce, I can fix it. So I will wait until then. |
@ritchie46 Turns out mine is not related to a https://storage.googleapis.com/endless-dialect-336512-ew6/14201.parquet
The last line causes it to go out of memory with 0.20.6, but not with <0.20.6. |
Thank you @jsarbach. I can do something with this. :) |
Hi. I found some time to look into this. And found that read_csv behaves a little different between 0.20.5 and 0.20.6. I had a csv input with every second line empty. In 0.20.5 the empty lines are just dropped, but in 0.20.6, they are included in the data frame as a row with all Null. This gave me a follow up problem when I used rle_id() on a column. Iti will give a group for each row, as they would be value, Null, value, Null and so on, so never two consecutive rows with the same value. I then later try to group_by on my rle_id column, and will of course not get any grouping. Finally add a Box to a plotly figure for each group. In principle it is then my plotting of the figure that hangs the system, as I will add a large amount of boxes to the figure, and make it extremely slow to draw. I hope it make some sense. A small example to the changed reading:
Which in 0.20.5 gives:
but in 0.20.6 gives:
Best regards Kristian |
@hagsted this is unrelated to the issue. And this is because a bug was fixed. Whitespace belongs to the value and is not longer ignored int he value. You can argue that in this case the schema inference is incorrect as it should all be string columns. Can you open a new issue for that? Then we can leave this issue on topic. |
Thanks @ritchie46 this seems to have fixed this issue on the dateset I originally observed the problem on! |
I think I found a reproducible example: #15615 |
Checks
Reproducible example
I've not found a reliable way of reproducing.
Log output
No response
Issue description
After upgrading to 0.20.6 I started experiencing high memory/cpu usage when using the
join
operator. I have a large script that I'm in the process of migrating from a jupyter/pandas implementation which has several steps so I've not been able to reliably reproduce. The issue occurs randomly (i.e. different lines in the script) but the general symptoms are:The
left
table as ~90k rows x 15 columns. Theright
table is very small (i.e. 500 rows, 2 columns) and is alookup
(i.e. match on a key column and return a value). So the code is simply:The "key" column is pl.String, both left and right tables contain nested types.
This type of join is repeated multiple times (basically appending columns by performing lookups) and usually takes <1s to execute. But randomly one of them will start using excessive CPU, memory will start linearly increasing until my swap is exhausted. But which operation seems random.
I tried saving the two tables to parquet to reproduce but no luck, it only happens when I run my script end to end (and not reliably in the same place)
I downgraded to 0.20.5 and the issue hasn't re-occurred.
It's happening on 2 machines (my workstation and in a docker container on AWS)
I'm not using lazy execution, that's on my todo after I've finished migrating the original code.
I'm not really sure what else i can do to help?
Expected behavior
Stable memory usage
Installed versions
The text was updated successfully, but these errors were encountered: