-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10542] [PYSPARK] fix serialize namedtuple #8707
Conversation
Test build #42289 has finished for PR 8707 at commit
|
6b9095b
to
1d766aa
Compare
Test build #42293 has finished for PR 8707 at commit
|
Test build #42295 has finished for PR 8707 at commit
|
Test build #42298 has finished for PR 8707 at commit
|
Test build #1739 has finished for PR 8707 at commit
|
Test build #42304 has finished for PR 8707 at commit
|
Over at https://issues.apache.org/jira/browse/SPARK-10544, someone commented to mention that other types of built-in types do not seem to be pickleable in 1.5. For instance, here's the example that they gave: sc.parallelize(["the red", "Fox Runs", "FAST"]).map(str.lower).count() However, this specific example also seems to fail in 1.3.1, so I don't think that this is a regression. Just wanted to mention this discussion here to make sure you were aware of it. |
Do you have any intuition for why this worked prior to 1.5 without the changes implemented here? |
from pyspark.cloudpickle import dumps | ||
P2 = loads(dumps(P)) | ||
p3 = P2(1, 3) | ||
self.assertEqual(p1, p3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting: presumably P
and P2
are different classes but instances created from them are still comparable for equality. Do we also need to check that those instances claim to belong to the same class? It seems way less likely that users could rely on the class comparison behavior, so probably not a huge priority to look at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These instances should become to difference classes.
Actually, one point of confusion: it looks like |
The HACK in serializers.py is used for cPickler, not cloudpickle. |
Before 1.5, the old way work in CPython, but not PyPy (we don't have a unit test for it). |
@JoshRosen BTW, this patch introduce a special case for namedtuple, it should be safe to merge into branch-1.5. |
Empirically, this seems to work, so unless you think that we should investigate the root cause any further I'm fine with giving this an LGTM and merging to 1.5. Feel free to merge yourself, or I can do it. |
I tried to find the root cause, but it seems hard to work in all Python versions (you can see them in the older commit), finally switch to current approach. merging this into master and 1.5 branch, thanks! |
Author: Davies Liu <[email protected]> Closes #8707 from davies/fix_namedtuple.
FYI, this works for us @ NinthDecimal. Thanks for the fix, it was a stumper! Python 2.7.6 |
Hi! I am not sure if this is related but is I look for this issue everything points me here basically. I'm getting
When trying to create a data frame from an RDD: rdd = self.sc.textFile(self.input_file_path).map(lambda line: self.process_line(line))
schema = StructType([StructField(u'Variable', StringType(), nullable=False),
StructField(u'Time', TimestampType(), nullable=False),
StructField(u'Value', FloatType(), nullable=False)])
return sql_context.createDataFrame(rdd, schema) I am on PySpark 1.6.0 - any ideas what I'm doing wrong here? |
I also get this error when using namedtuples
|
Author: Davies Liu <[email protected]> Closes apache#8707 from davies/fix_namedtuple. (cherry picked from commit d5c0361)
No description provided.