You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The parquet format docs have a dedicated timestamp type which I don't believe Delta is using. The parquet files written by Delta (the Spark implementation) write out an int96 type.
The parquet-tools CLI shows the column type from a .parquet file as:
I am assuming that the code which is doing this conversation on the INT96 column to a timezone is in consume_batch within primitive_array.rs but I'm not entirely sure.
I'm hoping for some help figuring out where the disconnect might be between how Delta Lake thinks "timestamp" should look (microseconds) versus the Parquet Rust reader which coerces that INT96 to nanoseconds.
I'm trying to figure out
Additional context
The text was updated successfully, but these errors were encountered:
The parquet reader is returning nanoseconds because that is the precision present in the encoding. I'm not familiar with deltalake's timestamp handling but it may be they assume all timestamps are microseconds. As this is not actually true, delta-rs should probably be adding coercion logic to convert where appropriate.
FWIW the Int96 encoding has been deprecated for almost a decade, it is slightly ridiculous that Spark still is using it
Which part is this question about
I am using the parquet crate through delta-rs and trying to understand the disconnect between Delta's interpretation of
timestamp
and parquet. For example, Delta considers timestamps as microseconds since epochDescribe your question
The parquet format docs have a dedicated timestamp type which I don't believe Delta is using. The parquet files written by Delta (the Spark implementation) write out an int96 type.
The
parquet-tools
CLI shows the column type from a.parquet
file as:When I modify the
read_parquet.rs
example, the schema ofRecordBatch
coming from an example file with the above column is:I am assuming that the code which is doing this conversation on the INT96 column to a timezone is in
consume_batch
withinprimitive_array.rs
but I'm not entirely sure.I'm hoping for some help figuring out where the disconnect might be between how Delta Lake thinks "timestamp" should look (microseconds) versus the Parquet Rust reader which coerces that INT96 to nanoseconds.
I'm trying to figure out
Additional context
The text was updated successfully, but these errors were encountered: