-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22025][PySpark] Speeding up fromInternal for StructField #19246
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -410,6 +410,24 @@ def __init__(self, name, dataType, nullable=True, metadata=None): | |
self.dataType = dataType | ||
self.nullable = nullable | ||
self.metadata = metadata or {} | ||
self.needConversion = dataType.needConversion | ||
self.toInternal = dataType.toInternal | ||
self.fromInternal = dataType.fromInternal | ||
|
||
def __getstate__(self): | ||
"""Return state values to be pickled.""" | ||
return (self.name, self.dataType, self.nullable, self.metadata) | ||
|
||
def __setstate__(self, state): | ||
"""Restore state from the unpickled state values.""" | ||
name, dataType, nullable, metadata = state | ||
self.name = name | ||
self.dataType = dataType | ||
self.nullable = nullable | ||
self.metadata = metadata | ||
self.needConversion = dataType.needConversion | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My only main concern is, it replaces the reference of the bound method from I just ran the Python profile on the top of the current master with this patch: Before
After
Looks the improvement is not quite significant. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. WDYT @ueshin? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the difference between your benchmark and @maver1ck's? Why are the improvements so different? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At the current master, 718bbc9 Before
df = spark.range(10000000).selectExpr("id as id0", "id as id1", "id as id2", "id as id3", "id as id4", "id as id5", "id as id6", "id as id7", "id as id8", "id as id9", "struct(id) as s").cache()
df.count()
df.rdd.map(lambda x: x).count()
sc.show_profiles()
After
df = spark.range(10000000).selectExpr("id as id0", "id as id1", "id as id2", "id as id3", "id as id4", "id as id5", "id as id6", "id as id7", "id as id8", "id as id9", "struct(id) as s").cache()
df.count()
df.rdd.map(lambda x: x).count()
sc.show_profiles()
|
||
self.toInternal = dataType.toInternal | ||
self.fromInternal = dataType.fromInternal | ||
|
||
def simpleString(self): | ||
return '%s:%s' % (self.name, self.dataType.simpleString()) | ||
|
@@ -431,15 +449,6 @@ def fromJson(cls, json): | |
json["nullable"], | ||
json["metadata"]) | ||
|
||
def needConversion(self): | ||
return self.dataType.needConversion() | ||
|
||
def toInternal(self, obj): | ||
return self.dataType.toInternal(obj) | ||
|
||
def fromInternal(self, obj): | ||
return self.dataType.fromInternal(obj) | ||
|
||
def typeName(self): | ||
raise TypeError( | ||
"StructField does not have typeName. " | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to handle pickle by ourselves because we have fields with function values