Fix bugs related to JDBC driver precedence and PostgreSQL JDBC subprotocol #143

JoshRosen · 2015-12-28T02:47:57Z

This patch fixes two bugs related to JDBC drivers:

The RedshiftJDBCWrapper.getDriverClass method mistakenly used postgres as the PostgreSQL JDBC subprotocol, whereas the correct subprotocol is actually postgresql (see java.lang.IllegalArgumentException: Unsupported JDBC protocol: 'postgresql' #126). This was straightforward to fix.
When the user specified an explicit JDBC driver class, we would register that driver with the JDBC DriverManager but would not necessarily use it to create connections. In a nutshell, the problem is that you might have multiple JDBC drivers on your classpath that claim to be able to handle the same subprotocol and there doesn't seem to be an intuitive way to control which of those drivers takes precedence. The change implemented in this patch is to iterate over the driver registry's loaded drivers in order to obtain the correct driver and then use that driver to create a connection.

Fixes #126.

JoshRosen · 2015-12-28T02:51:35Z

@yhuai @marmbrus, the JDBC driver precedence issue is not necessarily a bug given our user-facing documentation (if you carefully read what the documentation actually promises), but the current behavior is nonintuitive to me and I think that the new behavior implemented here is easier to reason about and more in line with what users might expect. I'd like to discuss whether we should make a similar change in Spark's JDBC data source, since I think that the new behavior implemented in this patch would allow us to work around some of the other JDBC driver precedence issues that we've seen in Spark (such as the issue with the Redshift driver taking precedence over the Postgres one for jdbc:postgresql/ URIs).

This change was motivated by the difficulties that I had while trying to write the Postgres integration suite that's included in this patch.

codecov-io · 2015-12-28T03:11:10Z

Current coverage is `89.18%`

Merging #143 into master will increase coverage by +0.21% as of 87f8bc9

@@            master    #143   diff @@
======================================
  Files           13      13       
  Stmts          635     638     +3
  Branches       140     140       
  Methods          0       0       
======================================
+ Hit            565     569     +4
  Partial          0       0       
+ Missed          70      69     -1

Review entire Coverage Diff as of 87f8bc9

Powered by Codecov. Updated on successful CI builds.

yhuai · 2015-12-28T20:41:41Z

looks good to me

JoshRosen · 2015-12-28T22:38:40Z

Cool. Let me take a look at the README / code comments to make sure that they reflect this change, then I'll merge.

I'm going to open a PR against Spark so that we mirror the same behavior there.

JoshRosen · 2015-12-30T00:59:16Z

src/main/scala/com/databricks/spark/redshift/RedshiftJDBCWrapper.scala

+    // At the same time, we don't want to create a driver-per-connection, so we use the
+    // DriverManager's driver instances to handle that singleton logic for us.
+    val driver: Driver = DriverManager.getDrivers.asScala.collectFirst {
+      case d if d.getClass.getCanonicalName == driverClass => d


I just realized that there's a bit of a bug here: if d is an instance of DriverWrapper then we need to compare against d.wrapped.getClass.getCanonicalName. I'll open a followup PR to fix this.

Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection. In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection. This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly). If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different). This patch is inspired by a similar patch that I made to the `spark-redshift` library (databricks/spark-redshift#143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons). Author: Josh Rosen <[email protected]> Closes apache#10519 from JoshRosen/jdbc-driver-precedence.

Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection. In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection. This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly). If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different). This patch is inspired by a similar patch that I made to the `spark-redshift` library (databricks/spark-redshift#143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons). Author: Josh Rosen <[email protected]> Closes #10519 from JoshRosen/jdbc-driver-precedence. (cherry picked from commit 6c83d93) Signed-off-by: Yin Huai <[email protected]>

…etConnector() This is a followup to #143 which fixes a corner-case bug in our scanning of registered JDBC drivers: we need to properly handle Spark's `DriverWrapper` drivers, which are used to wrap JDBC drivers in order to make them accessible from the root classloader so that the `DriverManager` can find them. A simpler, reflection-free version of this change was incorporated into apache/spark#10519 Author: Josh Rosen <[email protected]> Closes #147 from JoshRosen/jdbc-driver-precedence-round-2.

Fix JDBC driver precedence issue and postgres:// typo.

87f8bc9

JoshRosen added the bug label Dec 28, 2015

JoshRosen assigned yhuai Dec 28, 2015

JoshRosen added this to the 0.5.3 milestone Dec 28, 2015

Update README.md

0d02c7e

JoshRosen closed this in cdf03d9 Dec 29, 2015

JoshRosen deleted the postgres-driver-fixes branch December 29, 2015 00:03

JoshRosen reviewed Dec 30, 2015
View reviewed changes

JoshRosen mentioned this pull request Dec 30, 2015

[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence apache/spark#10519

Closed

JoshRosen mentioned this pull request Jan 4, 2016

Handle DriverWrapper when scanning registered drivers in JDBWrapper.getConnector() #147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs related to JDBC driver precedence and PostgreSQL JDBC subprotocol #143

Fix bugs related to JDBC driver precedence and PostgreSQL JDBC subprotocol #143

JoshRosen commented Dec 28, 2015

JoshRosen commented Dec 28, 2015

codecov-io commented Dec 28, 2015

yhuai commented Dec 28, 2015

JoshRosen commented Dec 28, 2015

JoshRosen Dec 30, 2015

Fix bugs related to JDBC driver precedence and PostgreSQL JDBC subprotocol #143

Fix bugs related to JDBC driver precedence and PostgreSQL JDBC subprotocol #143

Conversation

JoshRosen commented Dec 28, 2015

JoshRosen commented Dec 28, 2015

codecov-io commented Dec 28, 2015

Current coverage is 89.18%

yhuai commented Dec 28, 2015

JoshRosen commented Dec 28, 2015

JoshRosen Dec 30, 2015

Choose a reason for hiding this comment

Current coverage is `89.18%`