Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1053: Fix time zone offset precision when convert tool converts LocalDateTime to Timestamp is not consistent with the internal default precision of ORC #967

Merged
merged 6 commits into from
Dec 3, 2021

Conversation

guiyanakuang
Copy link
Member

@guiyanakuang guiyanakuang commented Dec 1, 2021

What changes were proposed in this pull request?

// use tool compute offset : 17762
int toolOffset = ((LocalDateTime) temporalAccessor).atZone(TimeZone.getTimeZone("America/New_York").toZoneId()).getOffset().getTotalSeconds();

// in orc internal compute offset: 18000
int orcInternalOffset = TimeZone.getTimeZone("America/New_York").getRawOffset() / 1000

This pr is designed to modify the implementation of the LocalDateTime to Timestamp conversion so that the time zone accuracy is consistent with the ORC internal accuracy during the conversion

Why are the changes needed?

Avoid inconsistencies between converted data and expected data from convert tools.

How was this patch tested?

Add issue-specific unit test

…onverts LocalDateTime to timestamp is not consistent with the internal default precision of ORC
@github-actions github-actions bot added the JAVA label Dec 1, 2021
@guiyanakuang
Copy link
Member Author

cc @dongjoon-hyun @williamhyun : )

@dongjoon-hyun
Copy link
Member

Thank you so much, @guiyanakuang !

@@ -372,4 +370,15 @@ Converter buildConverter(IntWritable startOffset, TypeDescription schema) {
throw new IllegalArgumentException("Unhandled type " + schema);
}
}

public static void main(String[] args) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very sorry for committing some validation code

}

@Test
public void testConvertCustomTimestampFromCsv() throws IOException, ParseException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

for (int row = 0; row < batch.size; ++row) {
Timestamp timestamp = Timestamp.valueOf(timeValues[row]);
assertEquals(timestamp.getTime(), tcv.time[row]);
assertEquals(timestamp.getNanos(), tcv.nanos[row]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to pass without your patch. Could you verify this again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified before submitting that my local default timezone, and the America/New_York timezone on which jira uploaded the file, would not pass the test based on past code.

The reason for this is the inconsistent interpretation of the timestamp offfset accuracy when writing and reading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    Timestamp timestamp1 = Timestamp.from(localDateTime.toInstant());
    Timestamp timestamp2 = Timestamp.valueOf(localDateTime);

Both methods use the local time zone to calculate the timestamp, but the offset precision of the time zone of the previous method is seconds (for New York is 17762 s), and the precision of the latter method is hours (for New York is 18000 s, which is 5 hours)
And the readings are interpreted by the latter precision (hour)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this yesterday and today. It passed on my laptop without your patch.

$ git diff main --stat
 java/tools/src/test/org/apache/orc/tools/convert/TestConvert.java | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

$ mvn package -pl tools -Dtest=org.apache.orc.tools.convert.TestConvert
...
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.orc.tools.convert.TestConvert
Processing test.csv
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.337 s - in org.apache.orc.tools.convert.TestConvert
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between you and me seems to be the timezones. You are in Asia timezone and I'm in America/Los Angeles timezone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the test case to be independent from the tester's environment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should specify the time zone to test, but there are many methods inside ORC to get the local default time zone, so it's not good to cover it uniformly at the moment, I need to think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I added two hooks using annotations to ensure that the default time zone during testing this class is New York, which feels good. Also I removed a redundant statement 😅.
You can try again when you have time, I'm sure the old code will not pass the test.

@dongjoon-hyun
Copy link
Member

Also, cc @wgtmac since C++ tool seems to have the same issue according to the JIRA.

@guiyanakuang
Copy link
Member Author

Also, cc @wgtmac since C++ tool seems to have the same issue according to the JIRA.

I have also debugged some of the C++ code. It looks like strptime can't convert values before 1970 on mac os.

char *left=strptime(col.c_str(), "%Y-%m-%d %H:%M:%S", &timeStruct);

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @guiyanakuang .
For C++ case, could you file a new JIRA for that?
ORC-1053 is enough for Java-only patch

@dongjoon-hyun dongjoon-hyun merged commit e787b8b into apache:main Dec 3, 2021
dongjoon-hyun pushed a commit that referenced this pull request Dec 3, 2021
…LocalDateTime` to `Timestamp` is not consistent with the internal default precision of ORC (#967)

### What changes were proposed in this pull request?

```java
// use tool compute offset : 17762
int toolOffset = ((LocalDateTime) temporalAccessor).atZone(TimeZone.getTimeZone("America/New_York").toZoneId()).getOffset().getTotalSeconds();

// in orc internal compute offset: 18000
int orcInternalOffset = TimeZone.getTimeZone("America/New_York").getRawOffset() / 1000
```

This pr is designed to modify the implementation of the LocalDateTime to Timestamp conversion so that the time zone accuracy is consistent with the ORC internal accuracy during the conversion

### Why are the changes needed?
Avoid inconsistencies between converted data and expected data from convert tools.

### How was this patch tested?
Add issue-specific unit test

(cherry picked from commit e787b8b)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to main/1.7.

@dongjoon-hyun dongjoon-hyun added this to the 1.7.2 milestone Dec 3, 2021
@guiyanakuang
Copy link
Member Author

+1, LGTM. Thank you, @guiyanakuang . For C++ case, could you file a new JIRA for that? ORC-1053 is enough for Java-only patch

Okay, I created the ORC-1055

@dongjoon-hyun
Copy link
Member

Thank you!

@wgtmac
Copy link
Member

wgtmac commented Dec 3, 2021

Also, cc @wgtmac since C++ tool seems to have the same issue according to the JIRA.

I have also debugged some of the C++ code. It looks like strptime can't convert values before 1970 on mac os.

char *left=strptime(col.c_str(), "%Y-%m-%d %H:%M:%S", &timeStruct);

Pre-1970 timestamp is a notorious issue. Not sure if these test cases will help:
https://github.com/apache/orc/blob/main/c%2B%2B/test/TestWriter.cc#L621
https://github.com/apache/orc/blob/main/c%2B%2B/test/TestWriter.cc#L741

@guiyanakuang
Copy link
Member Author

@wgtmac , Unfortunately I think this is a new issue.

int main(int,char* argv[]){
  struct tm timeStruct;
  std::string name = "0001-01-01 00:00:00.000";
  char *left=strptime(name.c_str(), "%Y-%m-%d %H:%M:%S", &timeStruct);
  time_t _time = timegm(&timeStruct);
  char *tail;
  double d = strtod(left, &tail);
  long na;
  if (tail != left) {
    na = static_cast<long>(d * 1000000000.0);
  } else {
    na = 0;
  }
  std::cout << _time << " " << na << std::endl;
}

On my mac machine the output is "-1 0"
Using the online C++ runtime environment claimed on Linux, the output is "-62135596800 0"
Time format strings that appear to be older than 1970 get -1 on the mac.

https://github.com/apache/orc/blob/main/c%2B%2B/test/TestWriter.cc#L621
https://github.com/apache/orc/blob/main/c%2B%2B/test/TestWriter.cc#L741

It looks like the first test verifies that we can write time as a negative number.
The second one holds our custom time zone. So I think this is a new issue.

Of course I verified the converted_by_cpp.orc uploaded by the user in ORC-1055, and it is clear that it is not converted on the mac, so there should be another issue that is not yet clear

@wgtmac
Copy link
Member

wgtmac commented Dec 6, 2021

@wgtmac , Unfortunately I think this is a new issue.

int main(int,char* argv[]){
  struct tm timeStruct;
  std::string name = "0001-01-01 00:00:00.000";
  char *left=strptime(name.c_str(), "%Y-%m-%d %H:%M:%S", &timeStruct);
  time_t _time = timegm(&timeStruct);
  char *tail;
  double d = strtod(left, &tail);
  long na;
  if (tail != left) {
    na = static_cast<long>(d * 1000000000.0);
  } else {
    na = 0;
  }
  std::cout << _time << " " << na << std::endl;
}

On my mac machine the output is "-1 0" Using the online C++ runtime environment claimed on Linux, the output is "-62135596800 0" Time format strings that appear to be older than 1970 get -1 on the mac.

https://github.com/apache/orc/blob/main/c%2B%2B/test/TestWriter.cc#L621
https://github.com/apache/orc/blob/main/c%2B%2B/test/TestWriter.cc#L741

It looks like the first test verifies that we can write time as a negative number. The second one holds our custom time zone. So I think this is a new issue.

Of course I verified the converted_by_cpp.orc uploaded by the user in ORC-1055, and it is clear that it is not converted on the mac, so there should be another issue that is not yet clear

Thanks for the input. I will take a look later this week.

@dongjoon-hyun
Copy link
Member

Thank you so much, @guiyanakuang and @wgtmac .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants