[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols #26480

huaxingao · 2019-11-12T05:53:34Z

What changes were proposed in this pull request?

Add multi-cols support in StopWordsRemover

Why are the changes needed?

As a basic Transformer, StopWordsRemover should support multi-cols.
Param stopWords can be applied across all columns.

Does this PR introduce any user-facing change?

StopWordsRemover.setInputCols
StopWordsRemover.setOutputCols

How was this patch tested?

Unit tests

huaxingao · 2019-11-12T05:58:15Z

mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala

+  /** @group setParam */
+  @Since("3.0.0")
+  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
+


I am debating if I should add stopWordsArray/caseSensitiveArray/localArray. Seems to me that users will use the same set of stopWords for all columns, so it's no need to add those.

SparkQA · 2019-11-12T07:24:26Z

Test build #113614 has finished for PR 26480 at commit 0d2f624.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

I think multi-column may not be a common use case for StopWordsRemover. I am fine to add it, anyway.

srowen · 2019-11-12T13:59:57Z

mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala

+    if (isSet(inputCols)) {
+      require(getInputCols.length == getOutputCols.length,
+        s"StopWordsRemover $this has mismatched Params " +
+          s"for multi-column transform. Params (inputCols, outputCols) should have " +


Nit: you don't need interpolation on these two lines.

srowen · 2019-11-12T14:00:41Z

mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala


 /**
 * A feature transformer that filters out stop words from input.
 *
+ * Since 3.0.0,


I don't feel strongly, but you could remove this.

Sorry, I accidentally broke the line, but I prefer to have it. When other features added the multi columns support, since xxx was added to the doc. Just try to be consistent with others.

mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala

srowen · 2019-11-12T14:02:54Z

mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala

+    }
+
+    val (inputColNames, outputColNames) = getInOutCols()
+    var outputFields = schema.fields


It will hardly matter unless the number of cols is large, but is it as easy and a little faster to .map the .zip below to the new output fields, and then append them once to schema.fields?

srowen · 2019-11-12T14:03:15Z

mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala

+    remover.transform(df)
+      .select("filtered1", "expected1", "filtered2", "expected2")
+      .collect().foreach {
+      case Row(r1: Seq[String], e1: Seq[String], r2: Seq[String], e2: Seq[String]) =>


Small nit: indent this more

SparkQA · 2019-11-12T19:40:28Z

Test build #113643 has finished for PR 26480 at commit fb082d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-11-13T14:18:29Z

Merged to master

huaxingao · 2019-11-13T16:25:55Z

Thanks!

argaytan · 2022-05-03T10:00:54Z

it doesn't work:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/feature/StopWordsRemover at com.kyndryl.etl.SparkJob$.main(SparkJob.scala:51) at com.kyndryl.etl.SparkJob.main(SparkJob.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.feature.StopWordsRemover at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 2 more

srowen · 2022-05-03T12:06:54Z

That is not related. You compiled and ran vs different versions of Spark.

argaytan · 2022-05-03T12:30:37Z

Maybe I need to try with other versions, currently I'm using:

scalaVersion := "2.13.8"
val SparkVersion = "3.2.1"

Thanks Sean :)

huaxingao added 2 commits November 11, 2019 21:37

[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols

ef1021a

minor fix

0d2f624

huaxingao commented Nov 12, 2019

View reviewed changes

viirya reviewed Nov 12, 2019

View reviewed changes

srowen reviewed Nov 12, 2019

View reviewed changes

address comments

fb082d7

srowen approved these changes Nov 13, 2019

View reviewed changes

srowen closed this in 1f4075d Nov 13, 2019

huaxingao deleted the spark-29808 branch November 13, 2019 16:26

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols #26480

[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols #26480

huaxingao commented Nov 12, 2019

huaxingao Nov 12, 2019

SparkQA commented Nov 12, 2019

viirya left a comment

srowen Nov 12, 2019

srowen Nov 12, 2019

huaxingao Nov 12, 2019 •

edited

Loading

srowen Nov 12, 2019

srowen Nov 12, 2019

SparkQA commented Nov 12, 2019

srowen commented Nov 13, 2019

huaxingao commented Nov 13, 2019

argaytan commented May 3, 2022

srowen commented May 3, 2022

argaytan commented May 3, 2022

[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols #26480

[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols #26480

Conversation

huaxingao commented Nov 12, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

huaxingao Nov 12, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 12, 2019

viirya left a comment

Choose a reason for hiding this comment

srowen Nov 12, 2019

Choose a reason for hiding this comment

srowen Nov 12, 2019

Choose a reason for hiding this comment

huaxingao Nov 12, 2019 • edited Loading

Choose a reason for hiding this comment

srowen Nov 12, 2019

Choose a reason for hiding this comment

srowen Nov 12, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 12, 2019

srowen commented Nov 13, 2019

huaxingao commented Nov 13, 2019

argaytan commented May 3, 2022

srowen commented May 3, 2022

argaytan commented May 3, 2022

huaxingao Nov 12, 2019 •

edited

Loading