[SPARK-43046] [SS] [Connect] Implemented Python API dropDuplicatesWithinWatermark for Spark Connect #40834

bogao007 · 2023-04-18T08:15:21Z

What changes were proposed in this pull request?

Implemented dropDuplicatesWithinWatermark Python API for Spark Connect. This change is based on a previous commit that introduced dropDuplicatesWithinWatermark API in Spark.

Why are the changes needed?

We recently introduced dropDuplicatesWithinWatermark API in Spark (commit link). We want to bring parity to the Spark Connect.

Does this PR introduce any user-facing change?

Yes, this introduces a new public API, dropDuplicatesWithinWatermark in Spark Connect.

How was this patch tested?

Added new test cases in test suites.

rangadi · 2023-04-19T00:39:29Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

@@ -744,6 +746,39 @@ class SparkConnectPlanner(val session: SparkSession) {
    }
  }

+  private def transformDeduplicateWithinWatermark(


Can we reuse transformDeduplicate(). We can pass in 'isWithinWatermark' flag to it. This code looks exactly same of that.

Sure, will update

rangadi · 2023-04-19T00:42:53Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

@@ -363,6 +364,23 @@ message Deduplicate {
  optional bool all_columns_as_keys = 3;
 }

+// Relation of type [[DeduplicateWithinWatermark]] which have duplicate rows removed within the time
+// range of watermark, could consider either only the subset of columns or all the columns.
+message DeduplicateWithinWatermark {


Optional: We can just reuse Deduplicate message inside here :).

What I can think of is to update to something like below:

message DeduplicateWithinWatermark { // (Required) Reuse the Deduplicate message for a DeduplicateWithinWatermark. Deduplicate deduplicate = 1; }

But if we do that, DeduplicateWithinWatermark.input would change to DeduplicateWithinWatermark.deduplicate.input. Is there a way to avoid that? If not, I think it's probably better to keep what it is today.

Yeah, this option does not look good. We could add a flag 'within_watermark' to Deduplicate message. That way we can reuse the code both on the client and server side.

Makes sense, updated.

bogao007 · 2023-04-19T00:53:13Z

@rangadi a general question: when I use dev/connect-gen-protos.sh to generate protobuf related changes, it automatically added a lot of lines of @typing_extensions.final to different files, should I remove this in my PR?

rangadi · 2023-04-19T01:13:33Z

when I use dev/connect-gen-protos.sh to generate protobuf related changes, it automatically added a lot of lines of @typing_extensions.final to different files, should I remove this in my PR?

No need. That is fine. It is a known problem.

rangadi

Looks great. One more tweak suggested.
LGTM after that.

rangadi · 2023-04-19T15:07:14Z

connector/connect/common/src/main/protobuf/spark/connect/relations.proto

@@ -68,6 +68,7 @@ message Relation {
    WithWatermark with_watermark = 33;
    ApplyInPandasWithState apply_in_pandas_with_state = 34;
    HtmlString html_string = 35;
+    Deduplicate deduplicate_within_watermark = 36;


This is not required, right? The flag in Deduplicate will indicate this.

rangadi · 2023-04-19T15:08:49Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

@@ -90,6 +90,8 @@ class SparkConnectPlanner(val session: SparkSession) {
      case proto.Relation.RelTypeCase.TAIL => transformTail(rel.getTail)
      case proto.Relation.RelTypeCase.JOIN => transformJoin(rel.getJoin)
      case proto.Relation.RelTypeCase.DEDUPLICATE => transformDeduplicate(rel.getDeduplicate)
+      case proto.Relation.RelTypeCase.DEDUPLICATE_WITHIN_WATERMARK =>


Connected to the proto comment. We don't need this field.

rangadi · 2023-04-20T00:19:25Z

python/pyspark/sql/connect/plan.py

@@ -623,6 +623,30 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
        return plan


+class DeduplicateWithinWatermark(LogicalPlan):


We don't ned this anymore, right? we can add the flag to Deduplicate above to match the rest of the PR.

yeah I noticed that as well, doing the change now

rangadi

                 —— SHIP IT ——

                   ,:',:`,:'
                __||_||_||_||__
           ____["""""""""""""""]____
           \ " '''''''''''''''''''' |
    ~^~~^~^~^^~^~^~^~^~^~^~^~~^~^~^^~~^~^

HyukjinKwon · 2023-04-21T07:03:55Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

@@ -750,7 +751,8 @@ class SparkConnectPlanner(val session: SparkSession) {
        }
        cols
      }
-      Deduplicate(groupCols, queryExecution.analyzed)
+      if (rel.getWithinWatermark) DeduplicateWithinWatermark(groupCols, queryExecution.analyzed)


Actually, should we have a dedicated protobuf message for DeduplicateWithinWatermark? Seems like we don't also share the same type of logical plan. Do you have any preference on this, @grundprinzip and @amaliujia?

Okay, just noticed that it was discussed at #40834 (comment). I will go ahead and merge if you guys don't have preference. I don't feel strongly about this either.

also @hvanhovell

@HyukjinKwon thanks. This is much simpler code wise. 1:1 for logical plans is not strictly required, I hope.

If there is anything wrong, I think deprecating a field is easier than deprecating a new relation type. Probably starting from this by adding a new flag is a good beginning.

HyukjinKwon · 2023-04-22T00:21:01Z

Merged to master.

HyukjinKwon · 2023-04-22T00:22:09Z

@bogao007 what's your JIRA id? I need to assign you in the JIRA ticket.

bogao007 · 2023-04-22T01:17:21Z

@bogao007 what's your JIRA id? I need to assign you in the JIRA ticket.

I think this might be my JIRA id 62cbecffa94a6f9c0efe1622, let me know if it doesn't work.

Implemented drop_duplicates_within_watermark

dd388c6

github-actions bot added CONNECT CORE PYTHON SQL labels Apr 18, 2023

bogao007 added 4 commits April 18, 2023 16:22

Merge branch 'master' into drop-dup-watermark

252dfb9

Fixed merge conflict

e7ae746

rerun proto script

68bec83

fixed scala format

9721aca

rangadi reviewed Apr 19, 2023

View reviewed changes

bogao007 added 4 commits April 19, 2023 10:53

Addressed PR comment

5305509

Merge branch 'master' into drop-dup-watermark

9a5d471

Resolved conflicts

bfbafa5

Addressed comment

a007ac2

rangadi reviewed Apr 19, 2023

View reviewed changes

Addressed PR comments

7e56c75

rangadi reviewed Apr 20, 2023

View reviewed changes

bogao007 added 3 commits April 20, 2023 08:23

Removed redundant code

55fdc47

Fixed python style errors

efd6abb

Merge branch 'master' into drop-dup-watermark

be96624

rangadi approved these changes Apr 20, 2023

View reviewed changes

bogao007 added 4 commits April 21, 2023 08:49

Fixed unsynced generated files

5989ad6

fixed generated files

a26d344

Merge branch 'master' into drop-dup-watermark

465e6c2

fixed generated files

2f63f7d

HyukjinKwon approved these changes Apr 21, 2023

View reviewed changes

HyukjinKwon reviewed Apr 21, 2023

View reviewed changes

HyukjinKwon closed this in 4d76511 Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43046] [SS] [Connect] Implemented Python API dropDuplicatesWithinWatermark for Spark Connect #40834

[SPARK-43046] [SS] [Connect] Implemented Python API dropDuplicatesWithinWatermark for Spark Connect #40834

bogao007 commented Apr 18, 2023

rangadi Apr 19, 2023

bogao007 Apr 19, 2023

rangadi Apr 19, 2023

bogao007 Apr 19, 2023

rangadi Apr 19, 2023

bogao007 Apr 19, 2023

bogao007 commented Apr 19, 2023

rangadi commented Apr 19, 2023

rangadi left a comment

rangadi Apr 19, 2023

rangadi Apr 19, 2023

rangadi Apr 20, 2023

bogao007 Apr 20, 2023

bogao007 Apr 20, 2023

rangadi left a comment

HyukjinKwon Apr 21, 2023

HyukjinKwon Apr 21, 2023

HyukjinKwon Apr 21, 2023

rangadi Apr 21, 2023

amaliujia Apr 21, 2023 •

edited

Loading

HyukjinKwon commented Apr 22, 2023

HyukjinKwon commented Apr 22, 2023

bogao007 commented Apr 22, 2023

		@@ -623,6 +623,30 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
		return plan


		class DeduplicateWithinWatermark(LogicalPlan):

[SPARK-43046] [SS] [Connect] Implemented Python API dropDuplicatesWithinWatermark for Spark Connect #40834

[SPARK-43046] [SS] [Connect] Implemented Python API dropDuplicatesWithinWatermark for Spark Connect #40834

Conversation

bogao007 commented Apr 18, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogao007 commented Apr 19, 2023

rangadi commented Apr 19, 2023

rangadi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rangadi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaliujia Apr 21, 2023 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Apr 22, 2023

HyukjinKwon commented Apr 22, 2023

bogao007 commented Apr 22, 2023

amaliujia Apr 21, 2023 •

edited

Loading