-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18560][CORE][STREAMING][WIP] Receiver data can not be deserialized properly. #15992
Conversation
0a68047
to
cbc4fc6
Compare
cc @JoshRosen |
Test build #69063 has finished for PR 15992 at commit
|
Test build #69064 has finished for PR 15992 at commit
|
Test build #69066 has finished for PR 15992 at commit
|
af44575
to
ec36a0e
Compare
Test build #69067 has finished for PR 15992 at commit
|
Test build #69070 has finished for PR 15992 at commit
|
Test build #69072 has finished for PR 15992 at commit
|
cc @zsxwing and @JoshRosen |
@@ -83,7 +84,12 @@ import org.apache.spark.storage.StorageLevel | |||
* }}} | |||
*/ | |||
@DeveloperApi | |||
abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable { | |||
abstract class Receiver[T](val storageLevel: StorageLevel)(implicit t: ClassTag[T]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering will this change the signature of Receiver
and break the existing code by user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, this change will indeed broke custom receiver implementation in Java, just like the update in JavaCustomReceiver.java. Besides, in my humble opinion, I did not encounter any problem in my local test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so in that case we should carefully think about this change, at least keep compatible with existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is really trickier to add ClassTag in many place to support automatically-pick-best-serializer feature.
I'm thinking if we could treat Spark Streaming as a special case to turn off type based auto serializer choosing mechanism. Current fix introduces lots of changes in the streaming code and may potentially break user's existing code. Just my two cents. |
@jerryshao yep, the simplest way is to turn it off. Generally, this is a followup patch for #11755. I tends to support automatically-pick-best-serializer feature here to get better performance, if there is no big damage to existing code, sooner rather than later. |
@JoshRosen and @zsxwing waiting for your feedback. |
NO one will review this PR? @srowen Could you please call someone to review this? Thanks! |
I am not familiar enough with this code to review it. I do think @JoshRosen is the right person given https://issues.apache.org/jira/browse/SPARK-13990 and believe he's said he will start reviewing again this week after the holiday |
Sorry for the delay. My high level comment is this is a critical bug and we also need to fix 2.0.x. However, we usually don't want to break APIs between maintenance releases. So I'm thinking the first step we should do is adding a configuration at https://github.com/apache/spark/pull/11755/files#diff-82b3c6a9530bd021cfb3d3f15056be41R49 and disable it for Streaming. |
@zsxwing Maybe, it is indeed a well-advised solution. I will provide another PR first to add a configuration for streaming. IMHO, I will keep this PR WIP for farther discussion and optimization. What is your opinion? |
…rk Streaming ## What changes were proposed in this pull request? #15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue #15992 to optimize the solution. ## How was this patch tested? existing ut Author: uncleGen <[email protected]> Closes #16052 from uncleGen/SPARK-18617. (cherry picked from commit 56c82ed) Signed-off-by: Reynold Xin <[email protected]>
…rk Streaming ## What changes were proposed in this pull request? #15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue #15992 to optimize the solution. ## How was this patch tested? existing ut Author: uncleGen <[email protected]> Closes #16052 from uncleGen/SPARK-18617.
I think this PR has one issue. There are two places requiring ClassTag. The first one is creating |
…rk Streaming ## What changes were proposed in this pull request? apache#15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue apache#15992 to optimize the solution. ## How was this patch tested? existing ut Author: uncleGen <[email protected]> Closes apache#16052 from uncleGen/SPARK-18617.
…rk Streaming ## What changes were proposed in this pull request? apache#15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue apache#15992 to optimize the solution. ## How was this patch tested? existing ut Author: uncleGen <[email protected]> Closes apache#16052 from uncleGen/SPARK-18617.
…rk Streaming ## What changes were proposed in this pull request? apache#15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue apache#15992 to optimize the solution. ## How was this patch tested? existing ut Author: uncleGen <[email protected]> Closes apache#16052 from uncleGen/SPARK-18617.
What changes were proposed in this pull request?
My spark streaming job can run correctly on Spark 1.6.1, but it can not run properly on Spark master branch, with following exception:
Dig deep into relevant implementation, I find the type of data received by Receiver is erased. And in Spark2.x, framework can choose a appropriate Serializer from JavaSerializer and KryoSerializer base on the type of data.
At the Receiver side, the type of data is erased to be Object, so framework will choose JavaSerializer, with following code:
At task side, we can get correct data type, and framework will choose KryoSerializer if possible, with following supported type:
In my case, the type of data is Byte Array.
This problem stems from SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs.
How was this patch tested?
update existing unit test