Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax #10380

Closed
wants to merge 7 commits into from

Conversation

CK50
Copy link
Contributor

@CK50 CK50 commented Dec 18, 2015

In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()):

INSERT INTO $table VALUES ( ?, ?, ..., ? )

But some technologies require a list of column names:

INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )

This was blocking the use of e.g. the Progress JDBC Driver for Cassandra.

Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc().

If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types.

This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names.

@srowen
Copy link
Member

srowen commented Dec 18, 2015

Jenkins, add to whitelist

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #48007 has finished for PR 10380 at commit 0772db3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #2232 has finished for PR 10380 at commit 0772db3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

@CK50 does the change in syntax work on all dialects currently supported by Spark SQL?

fieldsLeft = fieldsLeft - 1
}
conn.prepareStatement(sql.toString())
def insertStatement(conn: Connection,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This method is only used by the savePartition we could integrate it, since most of the code was moved to JdbcDialects.getInsertStatement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell
It works fine on Oracle and on Cassandra (Progress JDBC driver for Cassandra).
Commenting on other RDBMS: I was surprised that the column-free syntax is supported on so many databases. From all my work on different RDBMS, the syntax with column-names is much more the standard. - But I have not tested on other than Oracle and Cassandra.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell
re integrating JdbcDialects.getInsertStatement: I can certainly do so, if desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CK50 I don't have a strong opinion on this. I'd personally remove it, that's all.

@hvanhovell
Copy link
Contributor

@CK50 I did some quick googling, all of the dialects support this syntax (it might have been a dumb question on my part). LGTM.

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #48017 has finished for PR 10380 at commit a59c1aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -60,20 +60,6 @@ object JdbcUtils extends Logging {
}

/**
* Returns a PreparedStatement that inserts a row into table via conn.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the only problem here is that this is a public method, and while it feels like it was intended to be a Spark-only utility method, I'm not sure it's marked as such.

It's not a big deal to retain it and implement in terms of the new method. However it's now a function of a dialect, which is not an argument here. I suppose any dialect will do since they all behave the same now. This method could then be deprecated.

However: yeah, the behavior is actually the same for all dialects now. Really, this has come full circle and can just be a modification to this method, which was already the same for all dialects. Is there reason to believe the insert statement might vary later? Then I could see keeping the current structure here and just deprecating this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right about that. We'll break a public API. Is that a problem, since this is probably going into 2.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, though I think the intent was to back-port this to 1.6.x at least, as it's a moderately important fix. Conceptually, I don't think the API has changed; the insert statement is still not dialect-specific. Hence it seems like the current API is even desirable to maintain for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we have keep the method if we are back-porting. I have yet to encouter a database that doesn't support this insert syntax (did a bit more googling); so it seems safe to put in the generic method.

def insertStatement(conn: Connection,
table: String,
rddSchema: StructType,
dialect: JdbcDialect): PreparedStatement = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you've still changed the API though. I think the point is that it's not actually dialect-specific even after your change, right?

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48168 has finished for PR 10380 at commit c130e31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

moved generation of SQL insert statement back into JdbcUtils.scala
@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48189 has finished for PR 10380 at commit 79d3e0d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

fieldsLeft = fieldsLeft - 1
}
conn.prepareStatement(sql.toString())
val sql = rddSchema.fields.map(field => field.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, very close now. I think that can be tightened up in a few minor ways, like map(_.name) and map(_ => "?"). There's an extra space inside the paren in line 67, and the VALUES clause inserts extra spaces before between and after ?s. Finally the wrapping is a little funny. Maybe break this down into a val for the column names clause, and a val for the ? placeholders clause, and then return their interpolation into the final single format string INSERT INTO $table $columns VALUES $placeholders or something.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48194 has finished for PR 10380 at commit cae0f58.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2015

Test build #48199 has finished for PR 10380 at commit 5a7f262.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 22, 2015

Yes, nice and neat and addresses the issue directly. This LGTM

@asfgit asfgit closed this in 502476e Dec 24, 2015
@srowen
Copy link
Member

srowen commented Dec 24, 2015

Merged to master/1.6

asfgit pushed a commit that referenced this pull request Dec 24, 2015
…NSERT syntax

In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()):

INSERT INTO $table VALUES ( ?, ?, ..., ? )

But some technologies require a list of column names:

INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )

This was blocking the use of e.g. the Progress JDBC Driver for Cassandra.

Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc().

If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types.

This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names.

Author: CK50 <[email protected]>

Closes #10380 from CK50/master-SPARK-12010-2.

(cherry picked from commit 502476e)
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants