Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols #26480

Closed
wants to merge 3 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add multi-cols support in StopWordsRemover

Why are the changes needed?

As a basic Transformer, StopWordsRemover should support multi-cols.
Param stopWords can be applied across all columns.

Does this PR introduce any user-facing change?

StopWordsRemover.setInputCols
StopWordsRemover.setOutputCols

How was this patch tested?

Unit tests

/** @group setParam */
@Since("3.0.0")
def setOutputCols(value: Array[String]): this.type = set(outputCols, value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am debating if I should add stopWordsArray/caseSensitiveArray/localArray. Seems to me that users will use the same set of stopWords for all columns, so it's no need to add those.

@SparkQA
Copy link

SparkQA commented Nov 12, 2019

Test build #113614 has finished for PR 26480 at commit 0d2f624.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think multi-column may not be a common use case for StopWordsRemover. I am fine to add it, anyway.

if (isSet(inputCols)) {
require(getInputCols.length == getOutputCols.length,
s"StopWordsRemover $this has mismatched Params " +
s"for multi-column transform. Params (inputCols, outputCols) should have " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you don't need interpolation on these two lines.


/**
* A feature transformer that filters out stop words from input.
*
* Since 3.0.0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly, but you could remove this.

Copy link
Contributor Author

@huaxingao huaxingao Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I accidentally broke the line, but I prefer to have it. When other features added the multi columns support, since xxx was added to the doc. Just try to be consistent with others.

}

val (inputColNames, outputColNames) = getInOutCols()
var outputFields = schema.fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will hardly matter unless the number of cols is large, but is it as easy and a little faster to .map the .zip below to the new output fields, and then append them once to schema.fields?

remover.transform(df)
.select("filtered1", "expected1", "filtered2", "expected2")
.collect().foreach {
case Row(r1: Seq[String], e1: Seq[String], r2: Seq[String], e2: Seq[String]) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: indent this more

@SparkQA
Copy link

SparkQA commented Nov 12, 2019

Test build #113643 has finished for PR 26480 at commit fb082d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 13, 2019

Merged to master

@srowen srowen closed this in 1f4075d Nov 13, 2019
@huaxingao
Copy link
Contributor Author

Thanks!

@argaytan
Copy link

argaytan commented May 3, 2022

it doesn't work:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/feature/StopWordsRemover at com.kyndryl.etl.SparkJob$.main(SparkJob.scala:51) at com.kyndryl.etl.SparkJob.main(SparkJob.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.feature.StopWordsRemover at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 2 more

@srowen
Copy link
Member

srowen commented May 3, 2022

That is not related. You compiled and ran vs different versions of Spark.

@argaytan
Copy link

argaytan commented May 3, 2022

Maybe I need to try with other versions, currently I'm using:

scalaVersion := "2.13.8"
val SparkVersion = "3.2.1"

Thanks Sean :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants