[SPARK-37154][PYTHON] Inline hints for pyspark.rdd #35252

zero323 · 2022-01-19T20:20:11Z

What changes were proposed in this pull request?

This PR proposes migration of type hints for pyspark.rdd from stub file to inline annotation.

Why are the changes needed?

As a part of ongoing process of migration of stubs to inline hints.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests + new data tests.

zero323 · 2022-01-21T01:10:42Z

python/pyspark/conf.py

+    @overload
+    def get(self, key: str) -> Optional[str]:
+        ...
+
+    @overload
+    def get(self, key: str, defaultValue: None) -> Optional[str]:
+        ...
+
+    @overload
+    def get(self, key: str, defaultValue: str) -> str:
+        ...
+


These are added to clearly indicate which calls can result in None and, in turn, avoid ignores or casts in rdd.py.

qq: Maybe that is a stupid question, but I'd like to ask if
def get(self, key: str, defaultValue: Optional[str]) -> Optional[str]:
can't cover the both
def get(self, key: str, defaultValue: None) -> Optional[str]:
and
def get(self, key: str, defaultValue: str) -> str: ??

It covers both, but doesn't capture the same relationship between arguments. With (SparkConf, str, str) -> str we know that

conf.get("foo", "42")

is str, which saves as cast / ignores / asserts not None later.

With only (SparkConf, str, Optional[str]) -> Optional[str] we still have to assert that result is not None.

(There might be other way of capturing this through type parameters, i.e. (SparkConf, str, T) -> T where T is TypeVar("T", None, str))

zero323 · 2022-01-21T01:11:33Z

python/pyspark/serializers.py

+    def dumps(self, obj):
+        """
+        Serialize an object into a byte array.
+        When batching is used, this will be called with an array of objects.
+        """
+        raise NotImplementedError
+


In rdd.py we assume that implementations have dumps method.

zero323 · 2022-01-21T01:12:51Z

python/pyspark/context.py

@@ -1421,7 +1422,7 @@ def runJob(
        self,
        rdd: "RDD[T]",
        partitionFunc: Callable[[Iterable[T]], Iterable[U]],
-        partitions: Optional[List[int]] = None,
+        partitions: Optional[Sequence[int]] = None,


We use range in rdd.py, so we need generic type of collection.

zero323 · 2022-01-21T01:13:59Z

python/pyspark/rdd.py

        """
        The :class:`SparkContext` that this RDD was created on.
        """
        return self.ctx

-    def cache(self):
+    def cache(self: "RDD[T]") -> "RDD[T]":


Covariant types cannot be reliably used, so I used T where arbitrary RDD is used.

zero323 · 2022-01-21T01:15:45Z

python/pyspark/rdd.py

@@ -1440,9 +1666,9 @@ def mean(self):
        >>> sc.parallelize([1, 2, 3]).mean()
        2.0
        """
-        return self.stats().mean()
+        return self.stats().mean()  # type: ignore[return-value]


We might revisit StatCounter later (maybe it should be generic, but it is tricky to do right if input is empty), but for now let's use ignores.

zero323 · 2022-01-21T01:17:29Z

python/pyspark/rdd.py

+    @overload
+    def toDF(
+        self: "RDD[RowLike]",
+        schema: Optional[Union[List[str], Tuple[str, ...]]] = None,
+        sampleRatio: Optional[float] = None,
+    ) -> "DataFrame":
+        ...
+
+    @overload
+    def toDF(
+        self: "RDD[RowLike]", schema: Optional[Union["StructType", str]] = None
+    ) -> "DataFrame":
+        ...
+
+    @overload
+    def toDF(
+        self: "RDD[AtomicValue]",
+        schema: Union["AtomicType", str],
+    ) -> "DataFrame":
+        ...
+
+    def toDF(
+        self: "RDD[Any]", schema: Optional[Any] = None, sampleRatio: Optional[float] = None
+    ) -> "DataFrame":
+        raise RuntimeError("""RDD.toDF was called before SparkSession was initialized.""")


I am not very happy about this, but as far as I can tell it is the only way to type check toDF.

Just noticed that this was merged. So I guess we can wait for upgrade to mypy to dev and drop the implementation.

HyukjinKwon · 2022-01-21T04:47:32Z

cc @ueshin @viirya FYI.

python/pyspark/_typing.pyi

HyukjinKwon

Looks fine otherwise.

python/pyspark/rdd.py

viirya

I don't look at this in details, though, but looks fine.

zero323 · 2022-02-19T10:33:10Z

Merged into master.

Thanks all!

github-actions bot added CORE PYTHON SQL labels Jan 19, 2022

zero323 force-pushed the SPARK-37154 branch 3 times, most recently from 7ed4d7b to f59f408 Compare January 20, 2022 03:56

zero323 commented Jan 21, 2022

View reviewed changes

zero323 added 3 commits January 21, 2022 13:54

First pass

87e997e

Replace T_co with T outside Generic

023f424

Add data tests for simple methods

1391f58

zero323 force-pushed the SPARK-37154 branch from 7c35a7f to 1391f58 Compare January 21, 2022 17:01

zero323 changed the title ~~[WIP][SPARK-37154][PYTHON] Inline hints for pyspark.rdd.~~ [WIP][SPARK-37154][PYTHON] Inline hints for pyspark.rdd Jan 21, 2022

Add more data tests

2a37b86

zero323 changed the title ~~[WIP][SPARK-37154][PYTHON] Inline hints for pyspark.rdd~~ [SPARK-37154][PYTHON] Inline hints for pyspark.rdd Jan 23, 2022

zero323 marked this pull request as ready for review January 23, 2022 12:24

itholic reviewed Jan 27, 2022

View reviewed changes

python/pyspark/_typing.pyi Outdated Show resolved Hide resolved

O -> S

f49b55f

HyukjinKwon reviewed Feb 3, 2022

View reviewed changes

python/pyspark/_typing.pyi Show resolved Hide resolved

HyukjinKwon approved these changes Feb 3, 2022

View reviewed changes

viirya reviewed Feb 3, 2022

View reviewed changes

python/pyspark/rdd.py Outdated Show resolved Hide resolved

viirya reviewed Feb 3, 2022

View reviewed changes

zero323 added 2 commits February 3, 2022 11:58

Change "List[S]" to List["S"] in takeOrdered

ee99a79

Merge branch 'master' into SPARK-37154

c4c0eeb

zero323 mentioned this pull request Feb 15, 2022

[SPARK-37428][PYTHON][MLLIB] Inline type hints for pyspark.mllib.util #35532

Closed

zero323 added 2 commits February 16, 2022 19:19

Merge branch 'master' into SPARK-37154

87a97df

Merge branch 'master' into SPARK-37154

4708a68

zero323 closed this in 6ff760d Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37154][PYTHON] Inline hints for pyspark.rdd #35252

[SPARK-37154][PYTHON] Inline hints for pyspark.rdd #35252

zero323 commented Jan 19, 2022 •

edited

Loading

zero323 Jan 21, 2022

itholic Jan 27, 2022

zero323 Jan 28, 2022

zero323 Jan 21, 2022

zero323 Jan 21, 2022

zero323 Jan 21, 2022

zero323 Jan 21, 2022

zero323 Jan 21, 2022

zero323 Jan 23, 2022

HyukjinKwon commented Jan 21, 2022

HyukjinKwon left a comment

viirya left a comment

zero323 commented Feb 19, 2022

[SPARK-37154][PYTHON] Inline hints for pyspark.rdd #35252

[SPARK-37154][PYTHON] Inline hints for pyspark.rdd #35252

Conversation

zero323 commented Jan 19, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 21, 2022

HyukjinKwon left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

zero323 commented Feb 19, 2022

zero323 commented Jan 19, 2022 •

edited

Loading