Intermittent out-of-memory performing join after 0.20.6 update #14201

david-waterworth · 2024-02-01T22:43:52Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I've not found a reliable way of reproducing.

Log output

No response

Issue description

After upgrading to 0.20.6 I started experiencing high memory/cpu usage when using the join operator. I have a large script that I'm in the process of migrating from a jupyter/pandas implementation which has several steps so I've not been able to reliably reproduce. The issue occurs randomly (i.e. different lines in the script) but the general symptoms are:

The left table as ~90k rows x 15 columns. The right table is very small (i.e. 500 rows, 2 columns) and is a lookup (i.e. match on a key column and return a value). So the code is simply:

df = left.join(right, on="key", how="left")

The "key" column is pl.String, both left and right tables contain nested types.

This type of join is repeated multiple times (basically appending columns by performing lookups) and usually takes <1s to execute. But randomly one of them will start using excessive CPU, memory will start linearly increasing until my swap is exhausted. But which operation seems random.

I tried saving the two tables to parquet to reproduce but no luck, it only happens when I run my script end to end (and not reliably in the same place)

I downgraded to 0.20.5 and the issue hasn't re-occurred.

It's happening on 2 machines (my workstation and in a docker container on AWS)

I'm not using lazy execution, that's on my todo after I've finished migrating the original code.

I'm not really sure what else i can do to help?

Expected behavior

Stable memory usage

Installed versions

Replace this line with the output of pl.show_versions(). Leave the backticks in place.

The text was updated successfully, but these errors were encountered:

jsarbach · 2024-02-02T09:59:52Z

I experience the same, with lazy execution.

hagsted · 2024-02-02T12:31:46Z

I have also seen and endless running process. Polars 0.20.5 is fine. I have tried to make a minimal example, but is not able to reproduce with a something simple.

coinflip112 · 2024-02-02T12:36:52Z

Me and my colleagues experience similar issues. Lazy reads (parquet) with filters and selects seemed to read much more data to memory after upgrade then before. Some of us were successful when downgrading to 0.20.5 some of us had to go down to 0.20.3.

hagsted · 2024-02-02T13:49:48Z

I might add, that I am not sure if it is the join of if is the read_csv functions that are slow.

antonl · 2024-02-02T18:14:02Z

I'm seeing out of memory crashes also. @david-waterworth are you using the conda-forge build of polars?

david-waterworth · 2024-02-02T20:45:30Z

@antonl no I'm using PyPI (via poetry).

deanm0000 · 2024-02-02T20:52:53Z

Can you put some numbers and specifics behind this?

How long does it usually take?

How long does it take in the bad cases? Does it get stuck in an endless loop and never stop or it's just slower than you'd like?

What is the cpu activity like in good and bad case?

What is the memory usage in the good and bad case? (for example by looking at htop or win task manager)

You mention you have nested types, are you talking about just structs or structs with lists? lists with structs?

Do you make any mutations, (if so how many and of what sort?) to your key variable before the join?

If you cast to categorical before the join, do you still have the issue:

with pl.StringCache():
    left=left.with_columns(pl.col('key').cast(pl.Categorical))
    right=right.with_columns(pl.col('key').cast(pl.Categorical))
df=left.join(right, on="key", how="left")

For reference I can run this standalone simulation with no issue.

import polars as pl
import numpy as np


def gen_long_string(str_len, n_rows):
    rng = np.random.default_rng()
    return rng.integers(low=96, high=122, size=n_rows * str_len, dtype="uint32").view(
        f"U{str_len}"
    )

left_n=90_000
left = pl.DataFrame({
    'key':gen_long_string(20,int(left_n/5)).tolist()*5,
    **{str(x):np.random.normal(0,1,int(left_n/5)*5) for x in range(1,16)}  
})

right_n=500
right = pl.DataFrame({
    'key':left.sample(right_n).get_column('key').to_list(),
    'zz':np.random.normal(0,1,right_n)
})

left.join(right, on="key", how="left")

david-waterworth · 2024-02-03T00:37:55Z

@deanm0000 yes I can also run a similar stand-alone example with the exact same data that causes the issue in my script (i.e. I added a breakpoint and saved the 2 df's as parquet then loaded them in a notebook).

If I run with v0.20.5 it takes < 0.5s and the CPU/memory usage is negligible.

If I run with v0.20.6 it never completes as it crashes my machine. This takes a few minutes, initially all my cores spin up to 100% for maybe 30 seconds or so, then after that the CPU usage drops to one core but the memory starts ramping up and then eventually fails.

I have multiple struct fields, list[str] and list[int] and list[struct].

There's several steps as I'm migrating a bunch of manually run notebooks, and I more or less replicated the original steps, so at this points it's not overly optimised or refactored.

First I construct the main dataframe from a list (~90k elements) of dicts

        tasks_df = pl.from_dicts(tasks, schema=schema)

The I add rank, I wasn't sure how to do this as a single operation so I generated the rank over a subset and joined it back (the filter is also over a list[int] column.

        masked_rank = (
            tasks_df.filter(pl.col('missing_metadata').is_null())
            .with_columns(rank=pl.col('priority').rank(method='ordinal').over('rule_name', 'primary_equipment'))
            .select("row_id", "rank")
        )
        tasks_df = tasks_df.join(masked_rank, on='row_id', how='left')

Then there's a series of other joins, which is where the issue usually surfaces.

i.e.

       tasks_df = tasks_df.join(types_df, on="template", how="left")

The only mutation of the key variable I think is a rename (which I assume isn't technically a mutation?). But one of the joins that has defn failed does a group by on the key variable of the left table before the join (this is supposed to create one row per key in the right table, and converting the value to a list[str].

    primary_equipment_types_df = (
        templates_df.join(template_metadata_types_df, on="template_id")
        .group_by("template_name")
        .agg(primary_equipment_types=pl.col("type_code"))
        .rename({"template_name": "template"})
    )
    tasks_df = tasks_df.join(primary_equipment_types_df, on="template", how="left")

If there's any tracing I can run, I'll create a 0.20.6 branch next week if you need to me run anything.

deanm0000 · 2024-02-03T00:49:14Z

Could you try your joins with your strings cast to categorical before joining? That will strengthen the theory that the new string implementation is the cause.

david-waterworth · 2024-02-03T02:13:36Z

@deanm0000

Yeah I think your theory is correct. I reproduced the issue in my projects 0.20.6 branch. Then I ran again with the key cast to pl.Categorical and it ran fine.

To be 100% sure I ran the original code and broke on the actual line where it was crashing, performed the join in the debug window using pl.Categorical and it worked fine, then ran the pl.String version and it crashed.

So it seems fairly likely that the str refactor has introduced this.

detrin · 2024-02-03T20:39:48Z

We had to fix the production pipeline partially to 0.20.5. We will be looking forward to seeing the fix.

ritchie46 · 2024-02-04T06:44:54Z

Can someone please get a reproduction? How should I fix it if I cannot cause it?

@david-waterworth can you sent me your script perhaps (and remove parts that don't influence this behavior).

I really want to fix this, so help is greatly appreciated.

@detrin do you have something that we can use to reproduce this?

detrin · 2024-02-04T13:05:50Z

I have relatively free Sunday, I will be glad to prepare a minimal example.

detrin · 2024-02-04T14:23:24Z

@ritchie46 I don't know how to profile it well, from python I am getting too great variance to confirm the issue. I know for a fact that at this point of script I was getting an OOM. Here I extracted for you two intermediate tables that have 1000 lines for the sake of an example, in production score they have millions of rows.

import polars as pl
from memory_profiler import memory_usage  

program = pl.read_csv("https://eu2.contabostorage.com/62824c32198b4d53a08054da7a8b4df1:polarsissue14201/program_example.csv")
source = pl.read_csv("https://eu2.contabostorage.com/62824c32198b4d53a08054da7a8b4df1:polarsissue14201/source_example.csv")

def main(): 
    program_csfd = program.lazy().join(
        source.lazy(), how="left", left_on="program_title_norm", right_on="title_norm"
    ).collect(streaming=True)

if __name__ == '__main__':
    mem_usage = memory_usage(main)  
    # print(f"polars version: {pl.__version__}")
    print('Maximum memory usage: %s MB' % max(mem_usage))

ritchie46 · 2024-02-04T16:28:32Z

Have you got some more info? E.g. what you did up to this front? And I likely need more rows, 1000 rows is nothing. (I cannot concatenate that into a large dataset as that will lead to duplicates which will explode the join output).

ritchie46 · 2024-02-04T16:30:20Z

@hagsted @jsarbach @antonl does any of you have a reproducable examply?

detrin · 2024-02-04T19:02:41Z

Have you got some more info? E.g. what you did up to this front? And I likely need more rows, 1000 rows is nothing. (I cannot concatenate that into a large dataset as that will lead to duplicates which will explode the join output).

I can't provide the whole tables, I could perhaps reduce the number of columns and mask the titles, so that I would still see the increase of RAM used. That requires a lot of trial and error attempts, I am afraid I don't have that kind of time today or following days. However, I might prepare bigger reproducible example during workday. It most likely will be during this next week.

ritchie46 · 2024-02-04T19:50:26Z

Thanks @detrin. If I can reproduce, I can fix it. So I will wait until then.

jsarbach · 2024-02-04T19:59:24Z

@ritchie46 Turns out mine is not related to a join at all. Instead, it seems to come from list.eval.

https://storage.googleapis.com/endless-dialect-336512-ew6/14201.parquet

df = pl.read_parquet('14201.parquet')
df = df.filter(pl.col('id_numeric').is_not_null() & pl.col('Team').is_not_null()).group_by(by=['id_numeric', 'Team']).agg(pl.col('Items'))

df.with_columns(pl.col('Items').list.eval(pl.element().drop_nulls()))

The last line causes it to go out of memory with 0.20.6, but not with <0.20.6.

ritchie46 · 2024-02-04T20:17:41Z

Thank you @jsarbach. I can do something with this. :)

hagsted · 2024-02-04T20:45:02Z

Hi. I found some time to look into this. And found that read_csv behaves a little different between 0.20.5 and 0.20.6.

I had a csv input with every second line empty. In 0.20.5 the empty lines are just dropped, but in 0.20.6, they are included in the data frame as a row with all Null. This gave me a follow up problem when I used rle_id() on a column. Iti will give a group for each row, as they would be value, Null, value, Null and so on, so never two consecutive rows with the same value.

I then later try to group_by on my rle_id column, and will of course not get any grouping. Finally add a Box to a plotly figure for each group. In principle it is then my plotting of the figure that hangs the system, as I will add a large amount of boxes to the figure, and make it extremely slow to draw.

I hope it make some sense.

A small example to the changed reading:

text = """
    A, B, C,

    1,1,1,

    2,2,2,

    3,3,3,

"""
df = pl.read_csv(io.StringIO(text))
print(df)

Which in 0.20.5 gives:

shape: (3, 4)
┌───────────┬─────┬─────┬──────┐
│         A ┆  B  ┆  C  ┆      │
│ ---       ┆ --- ┆ --- ┆ ---  │
│ i64       ┆ i64 ┆ i64 ┆ str  │
╞═══════════╪═════╪═════╪══════╡
│ 1         ┆ 1   ┆ 1   ┆ null │
│ 2         ┆ 2   ┆ 2   ┆ null │
│ 3         ┆ 3   ┆ 3   ┆ null │
└───────────┴─────┴─────┴──────┘

but in 0.20.6 gives:

shape: (7, 4)
┌───────┬──────┬──────┬──────┐
│     A ┆  B   ┆  C   ┆      │
│ ---   ┆ ---  ┆ ---  ┆ ---  │
│ i64   ┆ i64  ┆ i64  ┆ str  │
╞═══════╪══════╪══════╪══════╡
│ null  ┆ null ┆ null ┆ null │
│ 1     ┆ 1    ┆ 1    ┆ null │
│ null  ┆ null ┆ null ┆ null │
│ 2     ┆ 2    ┆ 2    ┆ null │
│ null  ┆ null ┆ null ┆ null │
│ 3     ┆ 3    ┆ 3    ┆ null │
│ null  ┆ null ┆ null ┆ null │
└───────┴──────┴──────┴──────┘

Best regards Kristian

ritchie46 · 2024-02-04T20:48:28Z

@hagsted this is unrelated to the issue. And this is because a bug was fixed. Whitespace belongs to the value and is not longer ignored int he value. You can argue that in this case the schema inference is incorrect as it should all be string columns. Can you open a new issue for that? Then we can leave this issue on topic.

david-waterworth · 2024-02-05T02:19:31Z

Thanks @ritchie46 this seems to have fixed this issue on the dateset I originally observed the problem on!

karlwiese · 2024-04-12T14:42:24Z

I think I found a reproducible example: #15615

david-waterworth added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 1, 2024

deanm0000 added performance Performance issues or improvements regression Issue introduced by a new release P-high Priority: high A-dtype Area: data types in general and removed needs triage Awaiting prioritization by a maintainer labels Feb 3, 2024

github-project-automation bot added this to Backlog Feb 3, 2024

github-project-automation bot moved this to Ready in Backlog Feb 3, 2024

ritchie46 mentioned this issue Feb 4, 2024

fix: deduplicate recursive growables #14264

Merged

ritchie46 closed this as completed in #14264 Feb 4, 2024

github-project-automation bot moved this from Ready to Done in Backlog Feb 4, 2024

c-peters added the accepted Ready for implementation label Feb 5, 2024

c-peters assigned ritchie46 Feb 5, 2024

karlwiese mentioned this issue Apr 12, 2024

Grouping to list of struct is slower in 0.20.6 than in 0.20.5 and leads to out-of-memory eventually #15615

Closed

2 tasks

dhruvyy mentioned this issue Apr 19, 2024

Streaming pipeline runs out of memory #15771

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent out-of-memory performing join after 0.20.6 update #14201

Intermittent out-of-memory performing join after 0.20.6 update #14201

david-waterworth commented Feb 1, 2024

jsarbach commented Feb 2, 2024

hagsted commented Feb 2, 2024

coinflip112 commented Feb 2, 2024

hagsted commented Feb 2, 2024

antonl commented Feb 2, 2024

david-waterworth commented Feb 2, 2024

deanm0000 commented Feb 2, 2024

david-waterworth commented Feb 3, 2024 •

edited

Loading

deanm0000 commented Feb 3, 2024

david-waterworth commented Feb 3, 2024 •

edited

Loading

detrin commented Feb 3, 2024

ritchie46 commented Feb 4, 2024 •

edited

Loading

detrin commented Feb 4, 2024

detrin commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

detrin commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

jsarbach commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

hagsted commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

david-waterworth commented Feb 5, 2024

karlwiese commented Apr 12, 2024

Intermittent out-of-memory performing join after 0.20.6 update #14201

Intermittent out-of-memory performing join after 0.20.6 update #14201

Comments

david-waterworth commented Feb 1, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

jsarbach commented Feb 2, 2024

hagsted commented Feb 2, 2024

coinflip112 commented Feb 2, 2024

hagsted commented Feb 2, 2024

antonl commented Feb 2, 2024

david-waterworth commented Feb 2, 2024

deanm0000 commented Feb 2, 2024

david-waterworth commented Feb 3, 2024 • edited Loading

deanm0000 commented Feb 3, 2024

david-waterworth commented Feb 3, 2024 • edited Loading

detrin commented Feb 3, 2024

ritchie46 commented Feb 4, 2024 • edited Loading

detrin commented Feb 4, 2024

detrin commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

detrin commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

jsarbach commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

hagsted commented Feb 4, 2024

ritchie46 commented Feb 4, 2024

david-waterworth commented Feb 5, 2024

karlwiese commented Apr 12, 2024

david-waterworth commented Feb 3, 2024 •

edited

Loading

david-waterworth commented Feb 3, 2024 •

edited

Loading

ritchie46 commented Feb 4, 2024 •

edited

Loading