-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols #26480
Conversation
/** @group setParam */ | ||
@Since("3.0.0") | ||
def setOutputCols(value: Array[String]): this.type = set(outputCols, value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am debating if I should add stopWordsArray/caseSensitiveArray/localArray
. Seems to me that users will use the same set of stopWords
for all columns, so it's no need to add those.
Test build #113614 has finished for PR 26480 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think multi-column may not be a common use case for StopWordsRemover. I am fine to add it, anyway.
if (isSet(inputCols)) { | ||
require(getInputCols.length == getOutputCols.length, | ||
s"StopWordsRemover $this has mismatched Params " + | ||
s"for multi-column transform. Params (inputCols, outputCols) should have " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you don't need interpolation on these two lines.
|
||
/** | ||
* A feature transformer that filters out stop words from input. | ||
* | ||
* Since 3.0.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly, but you could remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I accidentally broke the line, but I prefer to have it. When other features added the multi columns support, since xxx
was added to the doc. Just try to be consistent with others.
} | ||
|
||
val (inputColNames, outputColNames) = getInOutCols() | ||
var outputFields = schema.fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will hardly matter unless the number of cols is large, but is it as easy and a little faster to .map the .zip below to the new output fields, and then append them once to schema.fields?
remover.transform(df) | ||
.select("filtered1", "expected1", "filtered2", "expected2") | ||
.collect().foreach { | ||
case Row(r1: Seq[String], e1: Seq[String], r2: Seq[String], e2: Seq[String]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: indent this more
Test build #113643 has finished for PR 26480 at commit
|
Merged to master |
Thanks! |
it doesn't work:
|
That is not related. You compiled and ran vs different versions of Spark. |
Maybe I need to try with other versions, currently I'm using: scalaVersion := "2.13.8" Thanks Sean :) |
What changes were proposed in this pull request?
Add multi-cols support in StopWordsRemover
Why are the changes needed?
As a basic Transformer, StopWordsRemover should support multi-cols.
Param stopWords can be applied across all columns.
Does this PR introduce any user-facing change?
StopWordsRemover.setInputCols
StopWordsRemover.setOutputCols
How was this patch tested?
Unit tests