-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3047] [PySpark] add an option to use str in textFileRDD #1951
Conversation
str is much efficient than unicode
QA tests have started for PR 1951. This patch merges cleanly. |
QA results for PR 1951: |
I think there's one more use of UTF8Deserializer, in |
def loads(self, stream): | ||
length = read_int(stream) | ||
return stream.read(length).decode('utf8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how we'll we've stuck to this convention in the existing code, but my original intention was that loads(loaded a single record and load_stream loaded a stream of records. If you wanted, we could conditionally define loads
based on whether we've set use_unicode
, which would allow the serializer to be used to deserialize an individual element or a stream.
This is a nice performance optimization. Should we document this somewhere? My concern is that users will never find out about it. |
@JoshRosen the use cases in worker.py, they are safe to changed from unicode to str, so I did not change them. |
QA tests have started for PR 1951 at commit
|
QA tests have finished for PR 1951 at commit
|
QA tests have started for PR 1951 at commit
|
QA tests have finished for PR 1951 at commit
|
""" | ||
Read a text file from HDFS, a local file system (available on all | ||
nodes), or any Hadoop-supported file system URI, and return it as an | ||
RDD of Strings. | ||
|
||
If use_unicode is False, the strings will be kept as `str` (encoding | ||
as `utf-8`), which is faster and smaller than unicode. (Added in | ||
Spark 1.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this didn't make it into 1.1, maybe we should change this to 1.2 (or just drop the version information completely).
Aside from the minor comment about version numbers, this looks good to me. I can see how this could lead to large performance wins for certain jobs (especially when parsing, say, numeric data that's stored in a CSV format). |
QA tests have started for PR 1951 at commit
|
QA tests have finished for PR 1951 at commit
|
Jenkins, retest this please. |
QA tests have started for PR 1951 at commit
|
QA tests have finished for PR 1951 at commit
|
QA tests have started for PR 1951 at commit
|
QA tests have started for PR 1951 at commit
|
QA tests have finished for PR 1951 at commit
|
QA tests have finished for PR 1951 at commit
|
This looks good to me, so I'm going to merge it into master. Thanks! |
### What changes were proposed in this pull request? Upgrade the Java builder to use Java 17. ### Why are the changes needed? Java 11 is deprecated. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local build ### Was this patch authored or co-authored using generative AI tooling? No
str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future).
use_unicode=True:
daviesliu@dm:~/work/spark$ time python wc.py
(u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)
real 2m8.298s
user 0m0.185s
sys 0m0.064s
use_unicode=False
daviesliu@dm:~/work/spark$ time python wc.py
('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)
real 1m26.402s
user 0m0.182s
sys 0m0.062s
We can see that it got 32% improvement!