-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row accessors #6222
[SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row accessors #6222
Conversation
Merged build triggered. |
Merged build started. |
Test build #32952 has started for PR 6222 at commit |
I noticed that |
93314c5
to
7d3222c
Compare
Merged build triggered. |
Merged build started. |
Test build #32953 has started for PR 6222 at commit |
I should probably re-run the original CatalystTypeConverter benchmarks to check that this hasn't regressed performance. |
@JoshRosen, thanks for the heads-up. This refactoring looks good. Grouping the conversions for each type makes it easier to find them. It also sets expectations for any other conversions that might be added. I'll comment specifically on #5713 over there. |
Test build #32953 has finished for PR 6222 at commit
|
Merged build finished. Test FAILed. |
Test FAILed. |
Based on the last set of test failures, it looks like there's another corner-case where the actual data types don't match the declared ones. JavaHashingTFSuite.hasingTF:
It also looks like there are some cases in HadoopFSRelationTest where we expect to receive a string but receive an Integer instead. I'll take a closer look at these cases to see whether this indicates a bug or whether we should just call toString() on any non-string type that we receive. |
I had put this patch on hold while finishing up some 1.4.0 stuff, but I plan to return to this soon. The fact that this is causing lots of test failures suggests that this could break things in bad ways for user code if we're not careful. I think that the main issue is that the old code would implicitly pass through primitives that were of the wrong type and allow implicit numeric conversions to take place. For instance, if I had a table that expected double-valued columns, it would be fine to pass integers since we never did any Scala -> Catalyst conversion for integers and the internal code implicitly handles these conversions somewhere. I'll see if I can come up with a clean way of allowing these implicit numeric conversions to still take place but guarantee that they're performed as part of the inbound Row conversion rather than implicitly in lower-level code. |
It looks like we have some undefined behavior that may be very difficult to preserve. If you declare a column to have some primitive type, like float, then proceed to pass in some value of an arbitrary different type, this won't cause any errors as long as you don't perform any operations on that column which rely on it being a particular type. In a nutshell, the schema's type for a column seems to be irrelevant if that column is just passed through unmodified / unaccessed as part of query processing. |
7d3222c
to
300723f
Compare
Merged build triggered. |
Merged build started. |
Test build #33457 has started for PR 6222 at commit |
I've pushed a change which will cause Technically, it looks like our Scaladoc API docs give us the leeway to throw exceptions if the input data doesn't match the expected schema and there's no mention of implicit casting / widening, so I think it's technically okay to change this undefined behavior, although it could be frustrating for users. * :: DeveloperApi ::
* Creates a [[DataFrame]] from an [[RDD]] containing [[Row]]s using the given schema.
* It is important to make sure that the structure of every [[Row]] of the provided RDD matches
* the provided schema. Otherwise, there will be runtime exception. |
Test build #33457 has finished for PR 6222 at commit
|
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build triggered. |
Merged build started. |
Test build #33458 has started for PR 6222 at commit |
The schema declares an array of booleans, but we passed an array of integers instead.
ee25e8d
to
740341b
Compare
Alright, pushed a commit to document the Option-handling semantics as well as to address the method dispatch issues for primitive types. This should be ready for review now. /cc @davies, you might want to look at this also. |
Merged build triggered. |
Merged build started. |
Test build #33868 has started for PR 6222 at commit |
Test build #33868 has finished for PR 6222 at commit
|
Merged build finished. Test FAILed. |
Jenkins, retest this please. |
Merged build triggered. |
Merged build started. |
Test build #33902 has started for PR 6222 at commit |
Test build #33902 has finished for PR 6222 at commit
|
Merged build finished. Test PASSed. |
LGTM, thanks! |
…c row accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - apache#6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - apache#6218: DataFrame.describe() should cast all aggregates to String - apache#6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <[email protected]> Closes apache#6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters
…c row accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - apache#6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - apache#6218: DataFrame.describe() should cast all aggregates to String - apache#6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <[email protected]> Closes apache#6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters
This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features.
At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of
CatalystTypeConveter
. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specificgetTYPE()
methods rather than the genericget()
/Row.apply
methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may outputUnsafeRow
instances which don't support the genericget()
.The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL:
Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the
createDataFrame()
Scaladoc:Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example,
JavaHashingTFSuite.hasingTF
passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions.In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases.