-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not create a TimestampNanosecondArray that has a specified timezone #597
Comments
@jorgecarleitao / @Dandandan / @GregBowyer / @velvia can you think of something I have missed here? Is it a reasonable proposal to switch to using |
C++ uses a string but I'm not familiar with rust implementation at all. |
@alamb there is actually a way to create primitive timezone arrays in other timezones. I have some code which I'll be PR'ing against both Arrow and DataFusion which does this, but includes other things we'll need. However, something like this:
You can use from_vec as well and it works too. I found other ways of creating timezone-based arrays as well. You don't need to create a type which has non-None timezone (yes I'm aware of that limitation and was going to point it out too). The key is that the underlying ArrayData has the correct, actual timezone, and this is what is returned by the data_type() calls and checked dynamically. There is an inconsistency there though, which I agree with, and solving that needs more discussion. Regarding switching to
|
@alamb PRs/issues which are related to this and other timezone issues and I plan to submit soon-ish:
|
Offsets are much more practical indeed but don't cover things like daylight savings changes and human readable timezones (e.g. EST, PST, CET). But I don't know what your scope is and offsets might be enough. Chrono-tz seems to implement a timezone database in case you decide to look into that direction. |
The scope is an internal, compact representation to 1) facilitate easy and fast computation, 2) minimize memory and storage costs, 3) minimize network serialization overhead in Ballista. It should be easily convertible to string representation. It is true that numeric offsets would lose some information, perhaps there are better representations out there, like an enum from chronos or something. At the same time, this is also a deep rabbit hole and trying to be completely accurate might also not be practical.... |
@velvia I suggest we ensure the Rust implementation can interoperate with the timezone / time definition (e.g. use a string as defined here: https://github.com/apache/arrow-rs/blob/master/format/Schema.fbs#L226-L246) That string representation also allows for numeric offsets like If enum Timezone {
/// UTC offset, in minutes (?tbd units)
Offset(i32)
/// character with the meaning from arrow definition
name(&'static str)
}
That is quite cool -- I did not realize that 👍 It is still somewhat strange because the
That is why I proposed using |
@alamb I agree that enum would cover all existing use cases. Still I'm curious, isn't there any restriction on what that string can be? In reality in order to perform any timezone adjustments, that string should really conform to one of the standard timezone abbreviations or fully qualified timezone names, or offset - otherwise it cannot be useful. Ideally it would be like an enum for each of the possible timezones in the IANA database, for which If we leave it as a string, there may be timezones which cannot be acted on.... I guess this is likely to be the case. Having the offset though (or a timezone enum) offers the possibility of faster translation. Agreed |
I do not see what is the concern with The root cause here seems to be the trait of the generic of the Arrow2 solves this by decoupling the generic on the PrimitiveArray from the (logical) I would recommend bridging this gap by refactoring the code to not have the AFAI understand Having a generic for every possible timezone significantly bloats the binary, as it requires all generics to be re-compiled for every declared timezone variation, and there are a lot of them. The typical problem will be in Generics are used to support different physical in-memory representations of a struct, not to declare semantics. Semantics based on generics is a recipe for large binaries and/or a large number of matches in downcasting trait objects. |
Good points @jorgecarleitao and @velvia -- it sounds like my challenge / problem with One thing I have noticed that might warrant more thought about something other than |
I would definitely agree that we don’t want a specific type per timezone and that the direction of arrow2 is the right one.
@jorgecarleitao the concern with using Strings is twofold:
1) Parsing cost. It means any timezone manipulation first requires parsing the string, and that has to be done for every single Array since each one owns its own String and could be different.
2) Storage and serialization costs.
Using Arc<> and friends would help at least number 2, but not number 1.
For fast data processing, a solution like the enum that Andrew proposed, or an enum with different timezones, for internal storage would solve both problems. One can remain compatible with the spec by still taking in a string in APIs, but allow storage and processing to be optimized.
… On Jul 24, 2021, at 3:56 AM, Andrew Lamb ***@***.***> wrote:
Good points @jorgecarleitao <https://github.com/jorgecarleitao> and @velvia <https://github.com/velvia> -- it sounds like my challenge / problem with TimestampNanosecondArray will be solved when we bring in the ideas of arrow2. If it becomes a problem I can look into removing DATA_TYPE as Jorge suggests.
One thing I have noticed that might warrant more thought about something other than String for storing timezones is that the DataType struct is copied around a lot in arrow code. Maybe something more like Arc<str> would be appropriate if we ever want to change the type.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#597 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIDPWYIKFKWZKSV35S6HZLTZKL6LANCNFSM5AZ6KW5Q>.
|
I believe https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.with_timezone provides the necessary API. This is also consistent with how we now handle decimals, where the DATA_TYPE is only the default value, and can be overriden with a later call to https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.with_precision_and_scale |
Describe the bug
I can not create a
PrimitiveArray
that has aDataType::Timestamp
with a timezone other thanNone
To Reproduce
I am trying to pretty print an array that has timestamps using UTC time.
So instead of
1970-01-01 00:00:00.000000100
I want to print something more like the following (note the Z, in RFC3339):1970-01-01T00:00:00.000000100Z
To do so I figured I would "simply" create a new array that had a Timestamp type with the timezone set to UTC
TimestampNanosecondArray
is defined like this:And
TimestampNanosecondType
is (effectively) something likeSo I figured I would make a
PrimitiveArray<SOME_TYPE_THAT_HAD_MY_TIMEZONE_SPECIFIED>
Uhoh! The compiler doesn't like that because it needs a
const String
which you can't have....So I conclude it is basically impossible to create an array with a timezone other than
None
Expected behavior
I expect to be able to create an array using a timezone
Proposal
Proposal:
Change
DataType::Timestamp
fromto
Additional context
FWIW I also tried using
lazy_static
but it doesn't provide aconst String
(only astatic String
)I view this as a potential first step towards handling timestamps properly in arrow and datafusion: apache/datafusion#686
The text was updated successfully, but these errors were encountered: