-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyarrow DictionaryArray
as partition column for write_deltalake
fails
#2969
Comments
Hi there, has this been resolved? |
This is caused by delta-rs/crates/core/src/writer/record_batch.rs Lines 455 to 457 in c623644
Which is coming from delta-kernel-rs. Please create an upsteam issue there https://github.com/delta-io/delta-kernel-rs. |
Not entirely sure anymore and could not find an explicit mention in the protocol at a quick glance, but I do believe that complex types are not supported for partition values. The only "hint" I could find though is that the documentation for partition value serialization omits these complex types. |
@roeap then we could add a simple check and raise if those columns are complex types |
absolutely - I do believe we should try and understand though what is happening today. from the repot it seems, this right now sometimes work, despite theoretically both approaches at least represent the same data. Probably be we not match on dict encoded arrays somewhere... It would be unfortunate if many people already use that in the wild - i.e. it does work somehow. In that case maybe we emit a waring for now and break in 1.0? |
I believe complex types have never been supported even in old hive style tables. Complex types don't have directly discernable equality and ordering. Is there a use case @jorritsandbrink you are trying to solve here that a complex type partition column was necessary? |
@hntd187 We have a load identifier (string) that is unique for each pipeline run. We store it in a column alongside the data. Records loaded in the same run all have the same value in the load identifier column. Using dictionary encoding for this column reduces data size a lot. Our queries filter on the load identifier and we like to partition the column for data skipping. |
So partition values are not physically stored in parquet files, they are normally kept in delta logs and projected into the data on a per partition basis. I would try instead of using a dictionary encoding just a standard string column and partitioning on that instead. If you create a table without the partitioning then you should notice the strings are kept in the physical parquet files, so by just using a normal string partitioning you more or less get the same benefits. The problem I think, as Robert mentioned above, is that partition values have to be string serializable which I do not think a dictionary array has an obvious method of being string serialized, but I might be wrong here. |
Right! Completely overlooked that.
We are indeed using a regular string column instead, and it's good knowing that it probably won't negatively impact performance. In that case, there is no clear use case for having a |
Environment
Delta-rs version: 0.21.0
Binding: python
Environment: local, WSL2, Ubuntu 24.04.1 LTS
Bug
What happened:
_internal.DeltaError: Generic DeltaTable error: Missing partition column: failed to parse
when using pyarrowDictionaryArray
as partition column forwrite_deltalake
.What you expected to happen:
Successful write.
How to reproduce it:
More details:
The text was updated successfully, but these errors were encountered: