-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5097][SQL] DataFrame #4173
Conversation
Test build #26008 has started for PR 4173 at commit
|
Test build #26008 has finished for PR 4173 at commit
|
Test FAILed. |
Test build #26010 has started for PR 4173 at commit
|
Test build #26010 has finished for PR 4173 at commit
|
Test FAILed. |
Test build #26011 has started for PR 4173 at commit
|
Test build #26011 has finished for PR 4173 at commit
|
Test FAILed. |
Test build #26033 has started for PR 4173 at commit
|
Test FAILed. |
Test build #26034 has started for PR 4173 at commit
|
Test build #26033 has finished for PR 4173 at commit
|
Test FAILed. |
Test build #26034 has finished for PR 4173 at commit
|
Test FAILed. |
Join(logicalPlan, right.logicalPlan, Inner, Some(joinExprs.expr)) | ||
} | ||
|
||
override def join(right: DataFrame, joinType: String, joinExprs: Column): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easier to do in Python/R if putting joinType at the end
Test build #26102 has started for PR 4173 at commit
|
Test build #26102 has finished for PR 4173 at commit
|
Test FAILed. |
Test build #26104 has started for PR 4173 at commit
|
* // The following are equivalent: | ||
* peopleDf.filter($"age" > 15) | ||
* peopleDf.where($"age" > 15) | ||
* peopleDf($"age > 15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing closing quote?
Test FAILed. |
Test build #26156 has finished for PR 4173 at commit
|
Test FAILed. |
Test build #26157 has finished for PR 4173 at commit
|
Test FAILed. |
def toDF: DataFrame = this | ||
|
||
/** Return the schema of this [[DataFrame]]. */ | ||
override def schema: StructType = queryExecution.analyzed.schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a new higher-level type for schema
as well ? It is painful as a user to dig into StructType
etc. -- Similarly while applying a schema to an RDD it would be good to have a higher-level type / constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is dtypes here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram Are you asking for something like RowType extends StructType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah dtypes
is close to what I was talking about and it probably is sufficient to get the schema out. However while applying a schema to a RDD one still needs to construct a StructType etc. It'll be great to have a lightweight way of saying something like DataFrame(rdd, colNames=c("age", "name"), colTypes=c("int", "character"))
fix collect with UDT and tests
Test build #26175 has started for PR 4173 at commit
|
Test build #26175 has finished for PR 4173 at commit
|
Test FAILed. |
|
||
override def getItem(ordinal: Column): Column = GetItem(expr, ordinal.expr) | ||
|
||
override def getField(fieldName: String): Column = GetField(expr, fieldName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might consider using apply
instead or in addition to this.
Test build #26187 has started for PR 4173 at commit
|
* | ||
* @param right Right side of the join. | ||
* @param joinExprs Join expression. | ||
* @param joinType One of: `inner`, `outer`, `left_outer`, `right_outer`, `semijoin`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't semi join have to specify left/right also?
Test build #26187 has finished for PR 4173 at commit
|
Test PASSed. |
Alright I'm going to merge this one since it touches too many moving parts. I will submit another PR later today to update documentation and address Michael's comments. It will also add more test. |
This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.
TODOs:
With the exception of Python support, other tasks can be done in separate, follow-up PRs.