zstd compression in write_ipc is incompatible with standalone zstd? #5000

indigoviolet · 2022-09-27T06:29:18Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of polars.

Issue Description

It seems that writing an IPC file with compression=zstd leads to a file that cannot be decompressed by the standalone zstd program. And an IPC file compressed by the standalone zstd program cannot be read by read_ipc

Reproducible Example

import polars as pl
df = pl.DataFrame({"a": [1,2]})
df.write_ipc("test1.arrow.zst", compression="zstd")
df.write_ipc("test2.arrow")

In [24]:
! source ../../.env && zstd --version && zstd -d test1.arrow.zst
! source ../../.env && zstd test2.arrow
*** zstd command line interface 64-bits v1.5.2, by Yann Collet ***
zstd: test1.arrow.zst: unsupported format <---------------------------------------- ERROR!!!
test2.arrow          : 39.20%   (   528   B =>    207   B, test2.arrow.zst)


In [23]:
pl.read_ipc("test1.arrow.zstd").frame_equal(pl.read_ipc("test2.arrow"))
could not mmap compressed IPC file, defaulting to normal read
Out [23]:
True

In [25]:
pl.read_ipc("test2.arrow.zst")

Truncated Traceback (Use C-c C-$ to view full TB):
File ~/dev/instant-science/t5/.venv/lib/python3.8/site-packages/polars/internals/dataframe/frame.py:788, in DataFrame._read_ipc(cls, file, columns, n_rows, row_count_name, row_count_offset, rechunk, memory_map)
    786 projection, columns = handle_projection_columns(columns)
    787 self = cls.__new__(cls)
--> 788 self._df = PyDataFrame.read_ipc(
    789     file,
    790     columns,
    791     projection,
    792     n_rows,
    793     _prepare_row_count_args(row_count_name, row_count_offset),
    794     memory_map=memory_map,
    795 )
    796 return self

ArrowErrorException: OutOfSpec("InvalidHeader")

Expected Behavior

I would have expected that these were interchangeable

Installed Versions

``` ---Version info--- Polars: 0.14.9 Index type: UInt32 Platform: Linux-4.19.0-21-cloud-amd64-x86_64-with-glibc2.10 Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] ---Optional dependencies--- pyarrow: 7.0.1 pandas: 1.5.0 numpy: 1.23.3 fsspec: 2022.8.2 connectorx: xlsx2csv: pytz: 2022.2.1

</details>

The text was updated successfully, but these errors were encountered:

indigoviolet · 2022-09-27T06:41:41Z

also woo! #5000

ritchie46 · 2022-09-27T06:44:51Z

also woo! #5000

LOL. Congrats! ;)

ritchie46 · 2022-09-27T06:47:15Z

I don't believe the compression is done on the final byte blob, but more on sub-blobs.

By that I mean that the IPC-file probably still has metadata/ipc file structure that is not compressed, but does have certain data chunks that are compressed.

@jorgecarleitao is this correct?

indigoviolet · 2022-09-27T07:41:34Z

also woo! #5000

LOL. Congrats! ;)

Well, congrats to you :)

ghuls · 2022-09-27T07:50:02Z

I don't believe the compression is done on the final byte blob, but more on sub-blobs.

By that I mean that the IPC-file probably still has metadata/ipc file structure that is not compressed, but does have certain data chunks that are compressed.

@jorgecarleitao is this correct?

Yeah, that is correct as far as I know. Only columns are compressed, not the file as whole.

indigoviolet · 2022-09-27T14:24:48Z

It seems worth documenting this difference. I guess there is a performance benefit that a subset of columns can be decompressed while reading?

jorgecarleitao · 2022-09-27T14:55:13Z

What @ghuls wrote. The rationale is that it allows

projection pushdown without having to decompress the whole chunk
consume the file in chunks without having to decompress the whole thing.
read the file metadata without having to decompress the whole file

Congrats everyone for the 5k!

arturdaraujo · 2022-10-05T18:02:18Z

I'm having the same problem

df = pl.read_ipc(r"..\NASDAQ\AACG.feather")
could not mmap compressed IPC file, defaulting to normal read

kylebarron · 2022-10-07T04:18:35Z

@arturdaraujo That should be just a warning, not an error, and I think is unrelated to this issue

stinodego · 2023-07-14T08:57:58Z

Closing this as it's not a bug. A feature request to change the functionality is welcome (I believe it already exists in the form of #9283).

indigoviolet added bug Something isn't working python Related to Python Polars labels Sep 27, 2022

ghuls added invalid A bug report that is not actually a bug and removed bug Something isn't working labels Sep 27, 2022

corneliusroemer mentioned this issue Jun 7, 2023

Support reading of zstd compressed csv files #9283

Open

stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstd compression in write_ipc is incompatible with standalone zstd? #5000

zstd compression in write_ipc is incompatible with standalone zstd? #5000

indigoviolet commented Sep 27, 2022

indigoviolet commented Sep 27, 2022

ritchie46 commented Sep 27, 2022

ritchie46 commented Sep 27, 2022

indigoviolet commented Sep 27, 2022

ghuls commented Sep 27, 2022

indigoviolet commented Sep 27, 2022

jorgecarleitao commented Sep 27, 2022

arturdaraujo commented Oct 5, 2022

kylebarron commented Oct 7, 2022

stinodego commented Jul 14, 2023

zstd compression in write_ipc is incompatible with standalone zstd? #5000

zstd compression in write_ipc is incompatible with standalone zstd? #5000

Comments

indigoviolet commented Sep 27, 2022

Polars version checks

Issue Description

Reproducible Example

Expected Behavior

Installed Versions

indigoviolet commented Sep 27, 2022

ritchie46 commented Sep 27, 2022

ritchie46 commented Sep 27, 2022

indigoviolet commented Sep 27, 2022

ghuls commented Sep 27, 2022

indigoviolet commented Sep 27, 2022

jorgecarleitao commented Sep 27, 2022

arturdaraujo commented Oct 5, 2022

kylebarron commented Oct 7, 2022

stinodego commented Jul 14, 2023