-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1053: Fix time zone offset precision when convert tool converts LocalDateTime
to Timestamp
is not consistent with the internal default precision of ORC
#967
Conversation
…onverts LocalDateTime to timestamp is not consistent with the internal default precision of ORC
cc @dongjoon-hyun @williamhyun : ) |
Thank you so much, @guiyanakuang ! |
@@ -372,4 +370,15 @@ Converter buildConverter(IntWritable startOffset, TypeDescription schema) { | |||
throw new IllegalArgumentException("Unhandled type " + schema); | |||
} | |||
} | |||
|
|||
public static void main(String[] args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very sorry for committing some validation code
} | ||
|
||
@Test | ||
public void testConvertCustomTimestampFromCsv() throws IOException, ParseException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this.
for (int row = 0; row < batch.size; ++row) { | ||
Timestamp timestamp = Timestamp.valueOf(timeValues[row]); | ||
assertEquals(timestamp.getTime(), tcv.time[row]); | ||
assertEquals(timestamp.getNanos(), tcv.nanos[row]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to pass without your patch. Could you verify this again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified before submitting that my local default timezone, and the America/New_York timezone on which jira uploaded the file, would not pass the test based on past code.
The reason for this is the inconsistent interpretation of the timestamp offfset accuracy when writing and reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timestamp timestamp1 = Timestamp.from(localDateTime.toInstant());
Timestamp timestamp2 = Timestamp.valueOf(localDateTime);
Both methods use the local time zone to calculate the timestamp, but the offset precision of the time zone of the previous method is seconds (for New York is 17762 s), and the precision of the latter method is hours (for New York is 18000 s, which is 5 hours)
And the readings are interpreted by the latter precision (hour)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this yesterday and today. It passed on my laptop without your patch.
$ git diff main --stat
java/tools/src/test/org/apache/orc/tools/convert/TestConvert.java | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 89 insertions(+)
$ mvn package -pl tools -Dtest=org.apache.orc.tools.convert.TestConvert
...
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.orc.tools.convert.TestConvert
Processing test.csv
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.337 s - in org.apache.orc.tools.convert.TestConvert
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between you and me seems to be the timezones. You are in Asia timezone and I'm in America/Los Angeles timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the test case to be independent from the tester's environment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should specify the time zone to test, but there are many methods inside ORC to get the local default time zone, so it's not good to cover it uniformly at the moment, I need to think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I added two hooks using annotations to ensure that the default time zone during testing this class is New York, which feels good. Also I removed a redundant statement 😅.
You can try again when you have time, I'm sure the old code will not pass the test.
Also, cc @wgtmac since C++ tool seems to have the same issue according to the JIRA. |
I have also debugged some of the C++ code. It looks like orc/tools/src/CSVFileImport.cc Line 257 in 334bf1f
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @guiyanakuang .
For C++ case, could you file a new JIRA for that?
ORC-1053 is enough for Java-only patch
…LocalDateTime` to `Timestamp` is not consistent with the internal default precision of ORC (#967) ### What changes were proposed in this pull request? ```java // use tool compute offset : 17762 int toolOffset = ((LocalDateTime) temporalAccessor).atZone(TimeZone.getTimeZone("America/New_York").toZoneId()).getOffset().getTotalSeconds(); // in orc internal compute offset: 18000 int orcInternalOffset = TimeZone.getTimeZone("America/New_York").getRawOffset() / 1000 ``` This pr is designed to modify the implementation of the LocalDateTime to Timestamp conversion so that the time zone accuracy is consistent with the ORC internal accuracy during the conversion ### Why are the changes needed? Avoid inconsistencies between converted data and expected data from convert tools. ### How was this patch tested? Add issue-specific unit test (cherry picked from commit e787b8b) Signed-off-by: Dongjoon Hyun <[email protected]>
Merged to main/1.7. |
Okay, I created the ORC-1055 |
Thank you! |
Pre-1970 timestamp is a notorious issue. Not sure if these test cases will help: |
@wgtmac , Unfortunately I think this is a new issue. int main(int,char* argv[]){
struct tm timeStruct;
std::string name = "0001-01-01 00:00:00.000";
char *left=strptime(name.c_str(), "%Y-%m-%d %H:%M:%S", &timeStruct);
time_t _time = timegm(&timeStruct);
char *tail;
double d = strtod(left, &tail);
long na;
if (tail != left) {
na = static_cast<long>(d * 1000000000.0);
} else {
na = 0;
}
std::cout << _time << " " << na << std::endl;
} On my mac machine the output is "-1 0"
It looks like the first test verifies that we can write time as a negative number. Of course I verified the |
Thanks for the input. I will take a look later this week. |
Thank you so much, @guiyanakuang and @wgtmac . |
What changes were proposed in this pull request?
This pr is designed to modify the implementation of the LocalDateTime to Timestamp conversion so that the time zone accuracy is consistent with the ORC internal accuracy during the conversion
Why are the changes needed?
Avoid inconsistencies between converted data and expected data from convert tools.
How was this patch tested?
Add issue-specific unit test