-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4985][SQL Parquet] Parquet date support #3855
Conversation
Jenkins, this is ok to test |
Test build #24979 has finished for PR 3855 at commit
|
cc @marmbrus |
d29224e
to
8851d1a
Compare
Test build #25509 has finished for PR 3855 at commit
|
8851d1a
to
929f294
Compare
Test build #25510 has finished for PR 3855 at commit
|
@@ -55,7 +55,7 @@ class ParquetSchemaSuite extends FunSuite with ParquetTest { | |||
|} | |||
""".stripMargin) | |||
|
|||
testSchema[(Byte, Short, Int, Long)]( | |||
testSchema[(Byte, Short, Int, Long, java.sql.Date)]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Usually I'd prefer import java.sql.Date
and just use Date
here.
Seems like this is subsumed by #3822, and thus we can close this issue. |
@@ -207,6 +207,7 @@ private[parquet] class RowWriteSupport extends WriteSupport[Row] with Logging { | |||
case DoubleType => writer.addDouble(value.asInstanceOf[Double]) | |||
case FloatType => writer.addFloat(value.asInstanceOf[Float]) | |||
case BooleanType => writer.addBoolean(value.asInstanceOf[Boolean]) | |||
case DateType => writer.addInteger(value.asInstanceOf[java.sql.Date].getTime.toInt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't conform to the Parquet specification for date and produces invalid data.
When using the DATE
annotation, the value must be the number of days from the Unix epoch, 1 January 1970. java.sql.Date
and java.util.Date
are backed by a long timestamp, the number of milliseconds from the Unix epoch (which is a Parquet TIMESTAMP_MILLIS
) and casting that value to an integer makes it impossible to recover the real date.
I recommend using TIMESTAMP_MILLIS
instead of date here (you won't need the toInt
part). That seems to be what you want, if you're interested in using java.sql.Date
. The reason why there is a name mismatch is that the Parquet types mirror SQL types more closely than Java objects.
I just looked at #3822 and it looks correct, so you can ignore my review comment above. In the future, please feel free to ping me for reviews when you're using logical types like Date, Timestamp, Decimal, etc. in either Parquet or Avro. I've been working on the specs in those communities and I'm happy to make sure the implementations look correct. |
@liancheng, I'll take a look as soon as I can. I'm a little swamped this week though, so I can't guarantee it'll be quick. Sorry! |
No description provided.