High memory usage reading single column with `read_parquet` #15098

khwilson · 2024-03-16T21:14:46Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Copy the following into example.py

import sys

import polars as pl

def make_dataframe():
    df = pl.DataFrame(
        {
            "a": list(range(int(5e8))),
            "b": list(range(int(5e8), 2 * int(5e8))),
        }
    )
    df.write_csv("test.csv")
    df.write_parquet("test.parquet")


if __name__ == "__main__":
    if sys.argv[1] == "make":
        make_dataframe()
    if sys.argv[1] == "scan":
        pl.scan_parquet("test.csv").select("a").collect().sum()
    if sys.argv[1] == "read":
        pl.read_parquet("test.parquet", columns=["a"]).sum()

Then run and compare the outputs of:

python3 example.py make

# This works on a Mac. Replace `-l` with `-v` if you have gnu time, e.g., on linux
/usr/bin/time -l python3 example.py scan
/usr/bin/time -l python3 example.py read

Log output

No response

Issue description

Similar to #8925, the behaviour of scan_parquet/csv and read_parquet/csv for reading a single column is surprising. In particular, when reading a single column from a parquet file with 1/2 billion rows, using read_parquet(filename, columns=[col_name]) takes nearly 4x the memory usage and 2x the time of calling scan_parquet(filename).select(col_name).collect().

On the other hand, read_csv takes half the memory memory and around 5/6 the time than running scan_csv.collect().

Note that some of this may be due to the rechunk option, but setting rechunk=False in read_parquet still leads to higher memory and time usage than scan_parquet.collect().

Detailed table below generated by this gist on an M2 mac running macOS 14.3.1.

function	peak memory (MB)	real time (s)
`read_parquet(rechunk=True).sum`	3116.73	0.66
`read_csv(rechunk=True).sum`	1553.45	0.73
`read_parquet(rechunk=False).sum`	1589.93	0.48
`read_csv(rechunk=False).sum`	791.04	0.61
`scan_parquet.collect.sum`	819.81	0.26
`scan_csv.collect.sum`	1553.58	0.69
`scan_parquet.sum.collect`	820.08	0.26
`scan_csv.sum.collect`	1553.73	0.68
`scan_parquet.sum.collect(streaming)`	821.63	0.30
`scan_csv.sum.collect(streaming)`	943.72	0.65

Here the function names correspond to:

.collect.sum: func(filename).select("a").collect().sum()
.sum.collect: func(filename).select("a").sum().collect()
.sum.collect(streaming): func(filename).select("a").sum().collect(streaming=True)

Expected behavior

I would expect reading a single column of parquet with read_parquet to take less time but perhaps more memory than scan_parquet().collect() and similarly for read_csv and scan_csv.

Installed versions

--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             macOS-14.3.1-arm64-arm-64bit
Python:               3.11.4 (main, Jul 30 2023, 21:55:46) [Clang 14.0.3 (clang-1403.0.22.14.1)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

itamarst · 2024-03-21T20:14:37Z

I can reproduce this locally, so I'll see if I can figure out what's going on.

itamarst · 2024-03-21T20:32:45Z

With rechunk=False, what I noticed is that the main factor is the existence of column b in the Parquet file. If the parquet file only has a single column a, scan and read take the same amount of memory.

With 3 columns, read of a single column a is 3× the memory usage of scan.

So at first glance, this suggests columns that aren't named are still somehow being read by read_parquet.

itamarst · 2024-03-21T20:51:09Z

The implementation of read_parquet() is essentially scan_parquet().select(columns).collect(no_optimization=True). And the no_optimization=True appears to be the source of the high memory usage. So either that can be just removed, or replaced with disabling most-but-not-the-important-one-for-this optimizations.

khwilson · 2024-03-22T13:28:11Z

Thank you for figuring this out!

#15285) Co-authored-by: Itamar Turner-Trauring <[email protected]> Co-authored-by: Stijn de Gooijer <[email protected]>

khwilson added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 16, 2024

itamarst mentioned this issue Mar 22, 2024

fix(python): Avoid loading all columns in read_parquet when columns parameter is specified #15229

Merged

stinodego closed this as completed in #15229 Mar 22, 2024

itamarst mentioned this issue Mar 22, 2024

Add Python test infrastructure for testing memory usage limits #15231

Closed

stinodego removed the needs triage Awaiting prioritization by a maintainer label Mar 22, 2024

ritchie46 pushed a commit that referenced this issue Mar 28, 2024

test(python): Memory usage test infrastructure, plus a test for #15098 (

9c46183

#15285) Co-authored-by: Itamar Turner-Trauring <[email protected]> Co-authored-by: Stijn de Gooijer <[email protected]>

itamarst mentioned this issue Mar 28, 2024

Minimal tests for memory usage contraints in partial-column read APIs #15371

Open

5 tasks

itamarst mentioned this issue Jul 20, 2024

Memory usage assertions in tests scientific-python/faster-scientific-python-ideas#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage reading single column with `read_parquet` #15098

High memory usage reading single column with `read_parquet` #15098

khwilson commented Mar 16, 2024

itamarst commented Mar 21, 2024

itamarst commented Mar 21, 2024

itamarst commented Mar 21, 2024 •

edited

Loading

khwilson commented Mar 22, 2024

High memory usage reading single column with read_parquet #15098

High memory usage reading single column with read_parquet #15098

Comments

khwilson commented Mar 16, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

itamarst commented Mar 21, 2024

itamarst commented Mar 21, 2024

itamarst commented Mar 21, 2024 • edited Loading

khwilson commented Mar 22, 2024

High memory usage reading single column with `read_parquet` #15098

High memory usage reading single column with `read_parquet` #15098

itamarst commented Mar 21, 2024 •

edited

Loading