Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23272][SQL] add calendar interval type support to ColumnVector #20438

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 30, 2018

What changes were proposed in this pull request?

ColumnVector is aimed to support all the data types, but CalendarIntervalType is missing. Actually we do support interval type for inner fields, e.g. ColumnarRow, ColumnarArray both support interval type. It's weird if we don't support interval type at the top level.

This PR adds the interval type support.

This PR also makes ColumnVector.getChild protect. We need it public because MutableColumnaRow.getInterval needs it. Now the interval implementation is in ColumnVector.getInterval.

How was this patch tested?

a new test.

@cloud-fan
Copy link
Contributor Author

@@ -195,6 +196,7 @@
* struct field.
*/
public final ColumnarRow getStruct(int rowId) {
if (isNullAt(rowId)) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

int month = data.getChild(0).getInt(offset + ordinal);
long microseconds = data.getChild(1).getLong(offset + ordinal);
return new CalendarInterval(month, microseconds);
return data.getInterval(offset + ordinal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should insert if (data.isNullAt(offset + ordinal)) return null; to be consistent with other ColumnarXxxs?
Or I guess we can remove these null-checks from all other ColumnarXxxs and leave it to ColumnVector.getInterval()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'm going to require ColumnVector.getXXX to return null if the value is null, but I'll do it in another PR, to update all the documents and define the behavior of batched getXXX methods.

@@ -87,7 +87,7 @@ public static CalendarInterval fromString(String s) {
}
}

public static long toLongWithRange(String fieldName,
private static long toLongWithRange(String fieldName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?! It's much harder (if at all possible) to test private methods (been bitten few times this week and remember the pain).

*
* In Spark, calendar interval type value is basically an integer value representing the number of
* months in this interval, and a long value representing the number of microseconds in this
* interval. A interval type vector is same as a struct type vector with 2 fields: `months` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"An interval" + "is the same as"

* `microseconds`.
*
* To support interval type, implementations must implement {@link #getChild(int)} and define 2
* child vectors: the first child vector is a int type vector, containing all the month values of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is an int type vector

*/
public final CalendarInterval getInterval(int rowId) {
if (isNullAt(rowId)) return null;
final int months = getChild(0).getInt(rowId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of final keyword here?

Copy link
Contributor Author

@cloud-fan cloud-fan Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from the previous code, probably it tries to make the compiler happy and run the code faster.

final long microseconds = getChild(1).getLong(rowId);
return new CalendarInterval(months, microseconds);
}

/**
* Returns the ordinal's child column vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@return [[ColumnVector]] at the ordinal

* interval. An interval type vector is the same as a struct type vector with 2 fields: `months`
* and `microseconds`.
*
* To support interval type, implementations must implement {@link #getChild(int)} and define 2
Copy link
Member

@kiszk kiszk Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: interval type -> calendar interval type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little annoying to type calendar interval type all the time...

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86819 has finished for PR 20438 at commit 4e53891.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2018

Test build #86825 has finished for PR 20438 at commit 2f23a1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
public abstract ColumnVector getChild(int ordinal);
protected abstract ColumnVector getChild(int ordinal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Now, it became protected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ColumnVector is public, could you add some description in PR description for this kind of visibility change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@ueshin
Copy link
Member

ueshin commented Jan 31, 2018

LGTM.

@@ -146,9 +146,7 @@ public UTF8String getUTF8String(int ordinal) {
@Override
public CalendarInterval getInterval(int ordinal) {
if (columns[ordinal].isNullAt(rowId)) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this null check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #20438 (comment) . In this PR I just fixed the returning null issue for getStruct and getInterval, because they are non-abstract. There should be a follow up to clearly document that ColumnVector.getBinary/getUTF8String/... should return null if this slot is null. Then we can remove these null checks here. I appreciate it if you have time to take this :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to prepare a PR tonight.

@@ -139,9 +139,7 @@ public UTF8String getUTF8String(int ordinal) {
@Override
public CalendarInterval getInterval(int ordinal) {
if (data.getChild(ordinal).isNullAt(rowId)) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this null check now?

@cloud-fan
Copy link
Contributor Author

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Jan 31, 2018
## What changes were proposed in this pull request?

`ColumnVector` is aimed to support all the data types, but `CalendarIntervalType` is missing. Actually we do support interval type for inner fields, e.g. `ColumnarRow`, `ColumnarArray` both support interval type. It's weird if we don't support interval type at the top level.

This PR adds the interval type support.

This PR also makes `ColumnVector.getChild` protect. We need it public because `MutableColumnaRow.getInterval` needs it. Now the interval implementation is in `ColumnVector.getInterval`.

## How was this patch tested?

a new test.

Author: Wenchen Fan <[email protected]>

Closes #20438 from cloud-fan/interval.

(cherry picked from commit 695f714)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 695f714 Jan 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants