Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent out-of-memory performing join after 0.20.6 update #14201

Closed
2 tasks done
david-waterworth opened this issue Feb 1, 2024 · 24 comments · Fixed by #14264
Closed
2 tasks done

Intermittent out-of-memory performing join after 0.20.6 update #14201

david-waterworth opened this issue Feb 1, 2024 · 24 comments · Fixed by #14264
Assignees
Labels
A-dtype Area: data types in general accepted Ready for implementation bug Something isn't working P-high Priority: high performance Performance issues or improvements python Related to Python Polars regression Issue introduced by a new release

Comments

@david-waterworth
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I've not found a reliable way of reproducing.

Log output

No response

Issue description

After upgrading to 0.20.6 I started experiencing high memory/cpu usage when using the join operator. I have a large script that I'm in the process of migrating from a jupyter/pandas implementation which has several steps so I've not been able to reliably reproduce. The issue occurs randomly (i.e. different lines in the script) but the general symptoms are:

The left table as ~90k rows x 15 columns. The right table is very small (i.e. 500 rows, 2 columns) and is a lookup (i.e. match on a key column and return a value). So the code is simply:

df = left.join(right, on="key", how="left")

The "key" column is pl.String, both left and right tables contain nested types.

This type of join is repeated multiple times (basically appending columns by performing lookups) and usually takes <1s to execute. But randomly one of them will start using excessive CPU, memory will start linearly increasing until my swap is exhausted. But which operation seems random.

I tried saving the two tables to parquet to reproduce but no luck, it only happens when I run my script end to end (and not reliably in the same place)

I downgraded to 0.20.5 and the issue hasn't re-occurred.

It's happening on 2 machines (my workstation and in a docker container on AWS)

I'm not using lazy execution, that's on my todo after I've finished migrating the original code.

I'm not really sure what else i can do to help?

Expected behavior

Stable memory usage

Installed versions

Replace this line with the output of pl.show_versions(). Leave the backticks in place.
@david-waterworth david-waterworth added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 1, 2024
@jsarbach
Copy link
Contributor

jsarbach commented Feb 2, 2024

I experience the same, with lazy execution.

@hagsted
Copy link

hagsted commented Feb 2, 2024

I have also seen and endless running process. Polars 0.20.5 is fine. I have tried to make a minimal example, but is not able to reproduce with a something simple.

@coinflip112
Copy link
Contributor

Me and my colleagues experience similar issues. Lazy reads (parquet) with filters and selects seemed to read much more data to memory after upgrade then before. Some of us were successful when downgrading to 0.20.5 some of us had to go down to 0.20.3.

@hagsted
Copy link

hagsted commented Feb 2, 2024

I might add, that I am not sure if it is the join of if is the read_csv functions that are slow.

@antonl
Copy link

antonl commented Feb 2, 2024

I'm seeing out of memory crashes also. @david-waterworth are you using the conda-forge build of polars?

@david-waterworth
Copy link
Author

@antonl no I'm using PyPI (via poetry).

@deanm0000
Copy link
Collaborator

Can you put some numbers and specifics behind this?

How long does it usually take?

How long does it take in the bad cases? Does it get stuck in an endless loop and never stop or it's just slower than you'd like?

What is the cpu activity like in good and bad case?

What is the memory usage in the good and bad case? (for example by looking at htop or win task manager)

You mention you have nested types, are you talking about just structs or structs with lists? lists with structs?

Do you make any mutations, (if so how many and of what sort?) to your key variable before the join?

If you cast to categorical before the join, do you still have the issue:

with pl.StringCache():
    left=left.with_columns(pl.col('key').cast(pl.Categorical))
    right=right.with_columns(pl.col('key').cast(pl.Categorical))
df=left.join(right, on="key", how="left")

For reference I can run this standalone simulation with no issue.

import polars as pl
import numpy as np


def gen_long_string(str_len, n_rows):
    rng = np.random.default_rng()
    return rng.integers(low=96, high=122, size=n_rows * str_len, dtype="uint32").view(
        f"U{str_len}"
    )

left_n=90_000
left = pl.DataFrame({
    'key':gen_long_string(20,int(left_n/5)).tolist()*5,
    **{str(x):np.random.normal(0,1,int(left_n/5)*5) for x in range(1,16)}  
})

right_n=500
right = pl.DataFrame({
    'key':left.sample(right_n).get_column('key').to_list(),
    'zz':np.random.normal(0,1,right_n)
})

left.join(right, on="key", how="left")

@david-waterworth
Copy link
Author

david-waterworth commented Feb 3, 2024

@deanm0000 yes I can also run a similar stand-alone example with the exact same data that causes the issue in my script (i.e. I added a breakpoint and saved the 2 df's as parquet then loaded them in a notebook).

If I run with v0.20.5 it takes < 0.5s and the CPU/memory usage is negligible.

If I run with v0.20.6 it never completes as it crashes my machine. This takes a few minutes, initially all my cores spin up to 100% for maybe 30 seconds or so, then after that the CPU usage drops to one core but the memory starts ramping up and then eventually fails.

I have multiple struct fields, list[str] and list[int] and list[struct].

There's several steps as I'm migrating a bunch of manually run notebooks, and I more or less replicated the original steps, so at this points it's not overly optimised or refactored.

First I construct the main dataframe from a list (~90k elements) of dicts

        tasks_df = pl.from_dicts(tasks, schema=schema)

The I add rank, I wasn't sure how to do this as a single operation so I generated the rank over a subset and joined it back (the filter is also over a list[int] column.

        masked_rank = (
            tasks_df.filter(pl.col('missing_metadata').is_null())
            .with_columns(rank=pl.col('priority').rank(method='ordinal').over('rule_name', 'primary_equipment'))
            .select("row_id", "rank")
        )
        tasks_df = tasks_df.join(masked_rank, on='row_id', how='left')

Then there's a series of other joins, which is where the issue usually surfaces.

i.e.

       tasks_df = tasks_df.join(types_df, on="template", how="left")

The only mutation of the key variable I think is a rename (which I assume isn't technically a mutation?). But one of the joins that has defn failed does a group by on the key variable of the left table before the join (this is supposed to create one row per key in the right table, and converting the value to a list[str].

    primary_equipment_types_df = (
        templates_df.join(template_metadata_types_df, on="template_id")
        .group_by("template_name")
        .agg(primary_equipment_types=pl.col("type_code"))
        .rename({"template_name": "template"})
    )
    tasks_df = tasks_df.join(primary_equipment_types_df, on="template", how="left")

If there's any tracing I can run, I'll create a 0.20.6 branch next week if you need to me run anything.

@deanm0000
Copy link
Collaborator

Could you try your joins with your strings cast to categorical before joining? That will strengthen the theory that the new string implementation is the cause.

@david-waterworth
Copy link
Author

david-waterworth commented Feb 3, 2024

@deanm0000

Yeah I think your theory is correct. I reproduced the issue in my projects 0.20.6 branch. Then I ran again with the key cast to pl.Categorical and it ran fine.

To be 100% sure I ran the original code and broke on the actual line where it was crashing, performed the join in the debug window using pl.Categorical and it worked fine, then ran the pl.String version and it crashed.

So it seems fairly likely that the str refactor has introduced this.

@deanm0000 deanm0000 added performance Performance issues or improvements regression Issue introduced by a new release P-high Priority: high A-dtype Area: data types in general and removed needs triage Awaiting prioritization by a maintainer labels Feb 3, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Feb 3, 2024
@detrin
Copy link

detrin commented Feb 3, 2024

We had to fix the production pipeline partially to 0.20.5. We will be looking forward to seeing the fix.

@ritchie46
Copy link
Member

ritchie46 commented Feb 4, 2024

Can someone please get a reproduction? How should I fix it if I cannot cause it?

@david-waterworth can you sent me your script perhaps (and remove parts that don't influence this behavior).

I really want to fix this, so help is greatly appreciated.

@detrin do you have something that we can use to reproduce this?

@detrin
Copy link

detrin commented Feb 4, 2024

I have relatively free Sunday, I will be glad to prepare a minimal example.

@detrin
Copy link

detrin commented Feb 4, 2024

@ritchie46 I don't know how to profile it well, from python I am getting too great variance to confirm the issue. I know for a fact that at this point of script I was getting an OOM. Here I extracted for you two intermediate tables that have 1000 lines for the sake of an example, in production score they have millions of rows.

import polars as pl
from memory_profiler import memory_usage  

program = pl.read_csv("https://eu2.contabostorage.com/62824c32198b4d53a08054da7a8b4df1:polarsissue14201/program_example.csv")
source = pl.read_csv("https://eu2.contabostorage.com/62824c32198b4d53a08054da7a8b4df1:polarsissue14201/source_example.csv")

def main(): 
    program_csfd = program.lazy().join(
        source.lazy(), how="left", left_on="program_title_norm", right_on="title_norm"
    ).collect(streaming=True)

if __name__ == '__main__':
    mem_usage = memory_usage(main)  
    # print(f"polars version: {pl.__version__}")
    print('Maximum memory usage: %s MB' % max(mem_usage))  

@ritchie46
Copy link
Member

Have you got some more info? E.g. what you did up to this front? And I likely need more rows, 1000 rows is nothing. (I cannot concatenate that into a large dataset as that will lead to duplicates which will explode the join output).

@ritchie46
Copy link
Member

@hagsted @jsarbach @antonl does any of you have a reproducable examply?

@detrin
Copy link

detrin commented Feb 4, 2024

Have you got some more info? E.g. what you did up to this front? And I likely need more rows, 1000 rows is nothing. (I cannot concatenate that into a large dataset as that will lead to duplicates which will explode the join output).

I can't provide the whole tables, I could perhaps reduce the number of columns and mask the titles, so that I would still see the increase of RAM used. That requires a lot of trial and error attempts, I am afraid I don't have that kind of time today or following days. However, I might prepare bigger reproducible example during workday. It most likely will be during this next week.

@ritchie46
Copy link
Member

Thanks @detrin. If I can reproduce, I can fix it. So I will wait until then.

@jsarbach
Copy link
Contributor

jsarbach commented Feb 4, 2024

@ritchie46 Turns out mine is not related to a join at all. Instead, it seems to come from list.eval.

https://storage.googleapis.com/endless-dialect-336512-ew6/14201.parquet

df = pl.read_parquet('14201.parquet')
df = df.filter(pl.col('id_numeric').is_not_null() & pl.col('Team').is_not_null()).group_by(by=['id_numeric', 'Team']).agg(pl.col('Items'))

df.with_columns(pl.col('Items').list.eval(pl.element().drop_nulls()))

The last line causes it to go out of memory with 0.20.6, but not with <0.20.6.

@ritchie46
Copy link
Member

Thank you @jsarbach. I can do something with this. :)

@hagsted
Copy link

hagsted commented Feb 4, 2024

Hi. I found some time to look into this. And found that read_csv behaves a little different between 0.20.5 and 0.20.6.

I had a csv input with every second line empty. In 0.20.5 the empty lines are just dropped, but in 0.20.6, they are included in the data frame as a row with all Null. This gave me a follow up problem when I used rle_id() on a column. Iti will give a group for each row, as they would be value, Null, value, Null and so on, so never two consecutive rows with the same value.

I then later try to group_by on my rle_id column, and will of course not get any grouping. Finally add a Box to a plotly figure for each group. In principle it is then my plotting of the figure that hangs the system, as I will add a large amount of boxes to the figure, and make it extremely slow to draw.

I hope it make some sense.

A small example to the changed reading:

text = """
    A, B, C,

    1,1,1,

    2,2,2,

    3,3,3,

"""
df = pl.read_csv(io.StringIO(text))
print(df)

Which in 0.20.5 gives:

shape: (3, 4)
┌───────────┬─────┬─────┬──────┐
│         A ┆  B  ┆  C  ┆      │
│ ---       ┆ --- ┆ --- ┆ ---  │
│ i64       ┆ i64 ┆ i64 ┆ str  │
╞═══════════╪═════╪═════╪══════╡
│ 1         ┆ 1   ┆ 1   ┆ null │
│ 2         ┆ 2   ┆ 2   ┆ null │
│ 3         ┆ 3   ┆ 3   ┆ null │
└───────────┴─────┴─────┴──────┘

but in 0.20.6 gives:

shape: (7, 4)
┌───────┬──────┬──────┬──────┐
│     A ┆  B   ┆  C   ┆      │
│ ---   ┆ ---  ┆ ---  ┆ ---  │
│ i64   ┆ i64  ┆ i64  ┆ str  │
╞═══════╪══════╪══════╪══════╡
│ null  ┆ null ┆ null ┆ null │
│ 1     ┆ 1    ┆ 1    ┆ null │
│ null  ┆ null ┆ null ┆ null │
│ 2     ┆ 2    ┆ 2    ┆ null │
│ null  ┆ null ┆ null ┆ null │
│ 3     ┆ 3    ┆ 3    ┆ null │
│ null  ┆ null ┆ null ┆ null │
└───────┴──────┴──────┴──────┘

Best regards Kristian

@ritchie46
Copy link
Member

@hagsted this is unrelated to the issue. And this is because a bug was fixed. Whitespace belongs to the value and is not longer ignored int he value. You can argue that in this case the schema inference is incorrect as it should all be string columns. Can you open a new issue for that? Then we can leave this issue on topic.

@david-waterworth
Copy link
Author

Thanks @ritchie46 this seems to have fixed this issue on the dateset I originally observed the problem on!

@karlwiese
Copy link

I think I found a reproducible example: #15615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype Area: data types in general accepted Ready for implementation bug Something isn't working P-high Priority: high performance Performance issues or improvements python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants