Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23272][SQL] add calendar interval type support to ColumnVector #20438

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -146,9 +146,7 @@ public byte[] getBinary(int ordinal) {
@Override
public CalendarInterval getInterval(int ordinal) {
if (columns[ordinal].isNullAt(rowId)) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this null check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #20438 (comment) . In this PR I just fixed the returning null issue for getStruct and getInterval, because they are non-abstract. There should be a follow up to clearly document that ColumnVector.getBinary/getUTF8String/... should return null if this slot is null. Then we can remove these null checks here. I appreciate it if you have time to take this :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to prepare a PR tonight.

final int months = columns[ordinal].getChild(0).getInt(rowId);
final long microseconds = columns[ordinal].getChild(1).getLong(rowId);
return new CalendarInterval(months, microseconds);
return columns[ordinal].getInterval(rowId);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
import org.apache.spark.unsafe.types.UTF8String;

/**
* A column vector backed by Apache Arrow. Currently time interval type and map type are not
* A column vector backed by Apache Arrow. Currently calendar interval type and map type are not
* supported.
*/
@InterfaceStability.Evolving
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import org.apache.spark.sql.catalyst.util.MapData;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.Decimal;
import org.apache.spark.unsafe.types.CalendarInterval;
import org.apache.spark.unsafe.types.UTF8String;

/**
Expand Down Expand Up @@ -195,6 +196,7 @@ public double[] getDoubles(int rowId, int count) {
* struct field.
*/
public final ColumnarRow getStruct(int rowId) {
if (isNullAt(rowId)) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

return new ColumnarRow(this, rowId);
}

Expand Down Expand Up @@ -236,9 +238,29 @@ public MapData getMap(int ordinal) {
public abstract byte[] getBinary(int rowId);

/**
* Returns the ordinal's child column vector.
* Returns the calendar interval type value for rowId.
*
* In Spark, calendar interval type value is basically an integer value representing the number of
* months in this interval, and a long value representing the number of microseconds in this
* interval. An interval type vector is the same as a struct type vector with 2 fields: `months`
* and `microseconds`.
*
* To support interval type, implementations must implement {@link #getChild(int)} and define 2
Copy link
Member

@kiszk kiszk Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: interval type -> calendar interval type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little annoying to type calendar interval type all the time...

* child vectors: the first child vector is an int type vector, containing all the month values of
* all the interval values in this vector. The second child vector is a long type vector,
* containing all the microsecond values of all the interval values in this vector.
*/
public final CalendarInterval getInterval(int rowId) {
if (isNullAt(rowId)) return null;
final int months = getChild(0).getInt(rowId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of final keyword here?

Copy link
Contributor Author

@cloud-fan cloud-fan Jan 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from the previous code, probably it tries to make the compiler happy and run the code faster.

final long microseconds = getChild(1).getLong(rowId);
return new CalendarInterval(months, microseconds);
}

/**
* @return child [[ColumnVector]] at the given ordinal.
*/
public abstract ColumnVector getChild(int ordinal);
protected abstract ColumnVector getChild(int ordinal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Now, it became protected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ColumnVector is public, could you add some description in PR description for this kind of visibility change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


/**
* Data type for this column.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,7 @@ public byte[] getBinary(int ordinal) {

@Override
public CalendarInterval getInterval(int ordinal) {
int month = data.getChild(0).getInt(offset + ordinal);
long microseconds = data.getChild(1).getLong(offset + ordinal);
return new CalendarInterval(month, microseconds);
return data.getInterval(offset + ordinal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should insert if (data.isNullAt(offset + ordinal)) return null; to be consistent with other ColumnarXxxs?
Or I guess we can remove these null-checks from all other ColumnarXxxs and leave it to ColumnVector.getInterval()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'm going to require ColumnVector.getXXX to return null if the value is null, but I'll do it in another PR, to update all the documents and define the behavior of batched getXXX methods.

}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,9 +139,7 @@ public byte[] getBinary(int ordinal) {
@Override
public CalendarInterval getInterval(int ordinal) {
if (data.getChild(ordinal).isNullAt(rowId)) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this null check now?

final int months = data.getChild(ordinal).getChild(0).getInt(rowId);
final long microseconds = data.getChild(ordinal).getChild(1).getLong(rowId);
return new CalendarInterval(months, microseconds);
return data.getChild(ordinal).getInterval(rowId);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -620,6 +620,39 @@ class ColumnarBatchSuite extends SparkFunSuite {
assert(column.arrayData().elementsAppended == 0)
}

testVector("CalendarInterval APIs", 4, CalendarIntervalType) {
column =>
val reference = mutable.ArrayBuffer.empty[CalendarInterval]

val months = column.getChild(0)
val microseconds = column.getChild(1)
assert(months.dataType() == IntegerType)
assert(microseconds.dataType() == LongType)

months.putInt(0, 1)
microseconds.putLong(0, 100)
reference += new CalendarInterval(1, 100)

months.putInt(1, 0)
microseconds.putLong(1, 2000)
reference += new CalendarInterval(0, 2000)

column.putNull(2)
reference += null

months.putInt(3, 20)
microseconds.putLong(3, 0)
reference += new CalendarInterval(20, 0)

reference.zipWithIndex.foreach { case (v, i) =>
val errMsg = "VectorType=" + column.getClass.getSimpleName
assert(v == column.getInterval(i), errMsg)
if (v == null) assert(column.isNullAt(i), errMsg)
}

column.close()
}

testVector("Int Array", 10, new ArrayType(IntegerType, true)) {
column =>

Expand Down Expand Up @@ -739,14 +772,20 @@ class ColumnarBatchSuite extends SparkFunSuite {

c1.putInt(0, 123)
c2.putDouble(0, 3.45)
c1.putInt(1, 456)
c2.putDouble(1, 5.67)

column.putNull(1)

c1.putInt(2, 456)
c2.putDouble(2, 5.67)

val s = column.getStruct(0)
assert(s.getInt(0) == 123)
assert(s.getDouble(1) == 3.45)

val s2 = column.getStruct(1)
assert(column.isNullAt(1))
assert(column.getStruct(1) == null)

val s2 = column.getStruct(2)
assert(s2.getInt(0) == 456)
assert(s2.getDouble(1) == 5.67)
}
Expand Down