-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to write parquet file with UTC timestamp #1932
Comments
Here are some things I've tried (none of them make any difference):
|
Could you expand a bit on what the expected behaviour is, as honestly cannot find any comprehensive document on how this is supposed to be handled. It's one of the many data model mismatches between arrow and parquet where it isn't really very clearly defined what is "correct" - #1666. Ultimately Parquet does not have a native mechanism to encode timezone information in its schema, instead opting for something slightly different - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp. The arrow schema is embedded in the parquet file, but as documented in #1663 this cannot be treated as authoritative. What I can say is the following:
|
Sure! For me, expected behavior is that import pandas as pd
assert str(pd.read_parquet("/tmp/q.parquet").dtypes.metric_date) == 'datetime64[ns, UTC]'
# and not 'datetime64[ns]' Thank you so much for explaining that given the nature of the specification, this might not be feasible; I was going crazy. In part, because this used to work (I have a python unit test that invokes rust code and reads parquet files generated by rust). Up to version 14 of parquet+arrow, this worked fine. But as of version 15, the behavior changed. |
This slightly simplified example shows different behavior when depending on use std::sync::Arc;
use arrow::{
array::{StringArray, TimestampMillisecondArray},
datatypes::{DataType, Field, Schema, TimeUnit},
record_batch::RecordBatch,
};
use parquet::arrow::arrow_writer::ArrowWriter;
fn main() {
let tz = Some("UTC".to_owned());
let fields = vec![
Field::new(
"metric_date",
DataType::Timestamp(TimeUnit::Millisecond, tz.clone()),
false,
),
Field::new("my_id", DataType::Utf8, false),
];
let schema = Arc::new(Schema::new(fields));
let my_ids = Arc::new(StringArray::from(vec!["hi", "there"]));
let dates = Arc::new(TimestampMillisecondArray::from_vec(
vec![1234532523, 1234124],
tz,
));
let batch = RecordBatch::try_new(schema.clone(), vec![dates, my_ids]).unwrap();
let f = std::fs::File::create("/tmp/q.parquet").unwrap();
let mut writer = ArrowWriter::try_new(f, schema, None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
println!("Hello, world!");
} Given the unfortunate state of the specification, I understand that the changes in version 15 might be better in many ways and fix all manner of issues, but in this regard, they constitute a regression. |
@tustvold Thank you so much for fixing this so quickly! I really appreciate it! We're using rust+parquet+python+serverless for geospatial computing at work and arrow-rs' work has been incredibly helpful! |
Describe the bug
I cannot figure out how to write a parquet file with a timestamp column that gets encoded as UTC. All my efforts produce files with naive timestamps and no UTC metadata.
To Reproduce
Consider this program: it writes a tiny parquet file to
/tmp/q.parquet
. But using bothpqrs
andpandas/pyarrow
on the resulting file shows that there is no timezone present -- the metric_date column is a naive timestamp.Additional context
Tested using arrow="16.0.0" and parquet="16.0.0".
The text was updated successfully, but these errors were encountered: