-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax #10380
Conversation
Jenkins, add to whitelist |
Test build #48007 has finished for PR 10380 at commit
|
Test build #2232 has finished for PR 10380 at commit
|
@CK50 does the change in syntax work on all dialects currently supported by Spark SQL? |
fieldsLeft = fieldsLeft - 1 | ||
} | ||
conn.prepareStatement(sql.toString()) | ||
def insertStatement(conn: Connection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This method is only used by the savePartition
we could integrate it, since most of the code was moved to JdbcDialects.getInsertStatement
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell
It works fine on Oracle and on Cassandra (Progress JDBC driver for Cassandra).
Commenting on other RDBMS: I was surprised that the column-free syntax is supported on so many databases. From all my work on different RDBMS, the syntax with column-names is much more the standard. - But I have not tested on other than Oracle and Cassandra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell
re integrating JdbcDialects.getInsertStatement: I can certainly do so, if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CK50 I don't have a strong opinion on this. I'd personally remove it, that's all.
@CK50 I did some quick googling, all of the dialects support this syntax (it might have been a dumb question on my part). LGTM. |
Test build #48017 has finished for PR 10380 at commit
|
@@ -60,20 +60,6 @@ object JdbcUtils extends Logging { | |||
} | |||
|
|||
/** | |||
* Returns a PreparedStatement that inserts a row into table via conn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, the only problem here is that this is a public method, and while it feels like it was intended to be a Spark-only utility method, I'm not sure it's marked as such.
It's not a big deal to retain it and implement in terms of the new method. However it's now a function of a dialect, which is not an argument here. I suppose any dialect will do since they all behave the same now. This method could then be deprecated.
However: yeah, the behavior is actually the same for all dialects now. Really, this has come full circle and can just be a modification to this method, which was already the same for all dialects. Is there reason to believe the insert statement might vary later? Then I could see keeping the current structure here and just deprecating this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right about that. We'll break a public API. Is that a problem, since this is probably going into 2.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, though I think the intent was to back-port this to 1.6.x at least, as it's a moderately important fix. Conceptually, I don't think the API has changed; the insert statement is still not dialect-specific. Hence it seems like the current API is even desirable to maintain for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we have keep the method if we are back-porting. I have yet to encouter a database that doesn't support this insert syntax (did a bit more googling); so it seems safe to put in the generic method.
def insertStatement(conn: Connection, | ||
table: String, | ||
rddSchema: StructType, | ||
dialect: JdbcDialect): PreparedStatement = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you've still changed the API though. I think the point is that it's not actually dialect-specific even after your change, right?
Test build #48168 has finished for PR 10380 at commit
|
moved generation of SQL insert statement back into JdbcUtils.scala
Test build #48189 has finished for PR 10380 at commit
|
fieldsLeft = fieldsLeft - 1 | ||
} | ||
conn.prepareStatement(sql.toString()) | ||
val sql = rddSchema.fields.map(field => field.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, very close now. I think that can be tightened up in a few minor ways, like map(_.name)
and map(_ => "?")
. There's an extra space inside the paren in line 67, and the VALUES clause inserts extra spaces before between and after ?s. Finally the wrapping is a little funny. Maybe break this down into a val
for the column names clause, and a val
for the ? placeholders clause, and then return their interpolation into the final single format string INSERT INTO $table $columns VALUES $placeholders
or something.
Test build #48194 has finished for PR 10380 at commit
|
Test build #48199 has finished for PR 10380 at commit
|
Yes, nice and neat and addresses the issue directly. This LGTM |
Merged to master/1.6 |
…NSERT syntax In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()): INSERT INTO $table VALUES ( ?, ?, ..., ? ) But some technologies require a list of column names: INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? ) This was blocking the use of e.g. the Progress JDBC Driver for Cassandra. Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc(). If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types. This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names. Author: CK50 <[email protected]> Closes #10380 from CK50/master-SPARK-12010-2. (cherry picked from commit 502476e) Signed-off-by: Sean Owen <[email protected]>
In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()):
INSERT INTO $table VALUES ( ?, ?, ..., ? )
But some technologies require a list of column names:
INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
This was blocking the use of e.g. the Progress JDBC Driver for Cassandra.
Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc().
If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types.
This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names.