[SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF #2494

rnowling · 2014-09-22T21:04:39Z

This PR for SPARK-3614 adds functionality for filtering out terms which do not appear in at least a minimum number of documents.

This is implemented using a minimumOccurence parameter (default 0). When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0. As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.

This PR makes the following changes:

Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
Updated the MLLib Feature Extraction programming guide to describe the new feature

Ishiihara · 2014-09-22T21:14:41Z

mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala

+         */
+        if(df(j) >= minimumOccurence) {
+          inv(j) = math.log((m + 1.0)/ (df(j) + 1.0))
+        } else {


else branch is not needed hers

Yes, the branch is needed. I don't modify the df vector -- I perform the filtering in idf().

when you allocate your inv array, by default, all values are 0. You do not need to set it to 0 again.

Ah, I see. I didn't consider that. I'll remove the else and add a comment to clarify the default behavior. Thanks.

Ishiihara · 2014-09-22T21:27:31Z

One question, with this parameter set, it also filter out words that is very important to some documents. Say, that if some word occurs many times in 1 or 2 documents, queries with that word should return these 2 documents. In other words, this approach may filter out words with high tf*idf values. How do you handle this case? Also, even with low df terms skipped, the output still take the same space. Any further thoughts on this to reduce the space?

rnowling · 2014-09-22T21:53:12Z

@Ishiihara If you look at the original JIRA, this was the functionality requested by the user. For the case you mention (high TF in a couple of documents), you would want to handle that separately in the transform() function where you could consider both the IDF and TF values.

As per space, it could be beneficial to create sparser vectors as a result of the filtering. However, I chose not to make that change since it may cause problems for some users since they would expect the resulting TF-IDF vectors to have the same values as the sparse or dense TF vectors. The way I've implemented the changes minimizes the overall effect on the user. I believe a separate PR should be created for considering space optimizations if they are going to change the API.

rnowling · 2014-09-22T23:48:31Z

I removed an unnecessary else blockpointed out by @Ishiihara and added a comment for clarification. Thanks @Ishiihara !

SparkQA · 2014-09-22T23:54:21Z

QA tests have started for PR 2494 at commit a200bab.

This patch merges cleanly.

SparkQA · 2014-09-22T23:55:20Z

QA tests have finished for PR 2494 at commit a200bab.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IDF(minimumOccurence: Long)
- class DocumentFrequencyAggregator(minimumOccurence: Long) extends Serializable

SparkQA · 2014-09-22T23:55:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20671/

Ishiihara · 2014-09-22T23:58:21Z

@rnowling Please run sbt/sbt scalastyle on your local machine to clear out style issues.

Ishiihara · 2014-09-23T00:10:32Z

mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala

 */
 @Experimental
-class IDF {
+class IDF(minimumOccurence: Long) {


You can add a val before minimumOccurence. Alternatively, if you want to set set minimumOccurence after new IDF(), you can define a private field and use a setter to set the value.

…table

SparkQA · 2014-09-23T00:34:28Z

QA tests have started for PR 2494 at commit 6897252.

This patch merges cleanly.

SparkQA · 2014-09-23T00:35:28Z

QA tests have finished for PR 2494 at commit 6897252.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IDF(val minimumOccurence: Long)
- class DocumentFrequencyAggregator(val minimumOccurence: Long) extends Serializable

SparkQA · 2014-09-23T00:35:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20675/

SparkQA · 2014-09-23T00:39:18Z

QA tests have started for PR 2494 at commit 1801fd2.

This patch merges cleanly.

rnowling · 2014-09-23T00:41:56Z

@Ishiihara Thanks for pointing out the style check -- I found and fixed the style error in IDF.scala.

Thanks for mentioning options for the mimimumOccurence members. I decided to add the val keyword over adding a setter. Earlier, I had considered several approaches including making it an optional parameter and adding a Scala-style setter, however I found that neither provided clean Java interoperability. As a result, I settled on the overloaded constructor approach, which is also a better match for Scala's emphasis on immutability. Since creating IDF's is inexpensive, I don't think performance will be an issue.

SparkQA · 2014-09-23T01:46:24Z

QA tests have finished for PR 2494 at commit 1801fd2.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IDF(val minimumOccurence: Long)
- class DocumentFrequencyAggregator(val minimumOccurence: Long) extends Serializable

SparkQA · 2014-09-23T01:46:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20676/

rnowling · 2014-09-23T12:44:04Z

Jenkins failed because I broke backwards compatibility in DocumentFrequencyAggregator by adding a required parameter to the constructor. I added a second parameter-less constructor that should fix the problem.

SparkQA · 2014-09-23T12:49:19Z

QA tests have started for PR 2494 at commit 1fc09d8.

This patch merges cleanly.

SparkQA · 2014-09-23T13:58:02Z

QA tests have finished for PR 2494 at commit 1fc09d8.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IDF(val minimumOccurence: Long)
- class DocumentFrequencyAggregator(val minimumOccurence: Long) extends Serializable

SparkQA · 2014-09-23T13:58:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20703/

mengxr · 2014-09-23T16:39:55Z

mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala

+ *
+ * @param minimumOccurence minimum of documents in which a term
+ *                         should appear for filtering
+ *


remove extra lines

rnowling · 2014-09-23T21:24:23Z

@mengxr tests failed again.

Here's where the error occurs:

Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/*" returned status code 143

Is it possible that a change was made to Jenkins that isn't populating the git command correctly?

mengxr · 2014-09-23T22:53:13Z

@rnowling let's retry :)

test this please

rnowling · 2014-09-24T16:36:07Z

@mengxr doesn't look like the tests started -- maybe Jenkins ignores comments that address users? Thanks!

mengxr · 2014-09-25T08:06:07Z

test this please

SparkQA · 2014-09-25T08:09:22Z

QA tests have started for PR 2494 at commit 0aa3c63.

This patch merges cleanly.

SparkQA · 2014-09-25T09:01:33Z

QA tests have finished for PR 2494 at commit 0aa3c63.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IDF(val minDocFreq: Int)
- class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable

AmplabJenkins · 2014-09-25T09:01:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20787/

rnowling · 2014-09-25T11:53:58Z

The flume sink test failed -- unrelated to my PR.

mengxr · 2014-09-25T16:37:52Z

@rnowling We have to see Jenkins happy before merge. @tdas @harishreedharan Could you take a look at the failed test? Thanks!

mengxr · 2014-09-25T16:37:58Z

test this please

SparkQA · 2014-09-25T16:44:26Z

QA tests have started for PR 2494 at commit 0aa3c63.

This patch merges cleanly.

SparkQA · 2014-09-25T17:35:34Z

QA tests have finished for PR 2494 at commit 0aa3c63.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IDF(val minDocFreq: Int)
- class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable

AmplabJenkins · 2014-09-25T17:35:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20810/

rnowling · 2014-09-25T19:42:17Z

@mengxr Flume test failed again.

And I agree with you -- needs to pass tests (even if failures are unrelated) before we commit. We went through this with one of my recent doc fixes, too. :)

harishreedharan · 2014-09-25T19:48:25Z

I am working on the test that failed in the first run (with the SparkSinkSuite). I didn't write the test (or the suite) that failed in the 2nd run, but I will take a look at it later today.

mengxr · 2014-09-26T08:26:15Z

test this please

SparkQA · 2014-09-26T08:29:29Z

QA tests have started for PR 2494 at commit 0aa3c63.

This patch merges cleanly.

SparkQA · 2014-09-26T09:37:20Z

QA tests have finished for PR 2494 at commit 0aa3c63.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class IDF(val minDocFreq: Int)
- class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable

AmplabJenkins · 2014-09-26T09:37:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20850/

mengxr · 2014-09-26T17:00:46Z

LGTM (and Jenkins finally). Merged into master. Thanks!

rnowling · 2014-09-26T17:27:00Z

Thanks @mengxr and @Ishiihara !

…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue #3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue #4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue #3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue #3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue #4075](h2database/h2database#4075): infinite loop in compact - [Issue #4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue #4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR #3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR #3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR #3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR #3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR #3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR #3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR #3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue #3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR #3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue #3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR #3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR #2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR #2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR #2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47810 from wayneguow/ug_h2. Authored-by: Wei Guo <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47810 from wayneguow/ug_h2. Authored-by: Wei Guo <[email protected]> Signed-off-by: Kent Yao <[email protected]>

rnowling added 2 commits September 22, 2014 16:53

Add minimumOccurence filtering to IDF

c0cc643

Remove accidentally-added import from testing

4b974f5

Ishiihara reviewed Sep 22, 2014
View reviewed changes

Remove unnecessary else statement

a200bab

Ishiihara reviewed Sep 23, 2014
View reviewed changes

Preface minimumOccurence members with val to make them final and immu…

6897252

…table

Fix style errors in IDF.scala

1801fd2

Add backwards-compatible constructor to DocumentFrequencyAggregator

1fc09d8

mengxr reviewed Sep 23, 2014
View reviewed changes

asfgit closed this in ec9df6a Sep 26, 2014

rnowling deleted the spark-3614-idf-filter branch September 26, 2014 17:27

[SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF #2494

[SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF #2494

Conversation

rnowling commented Sep 22, 2014

Ishiihara Sep 22, 2014

Choose a reason for hiding this comment

rnowling Sep 22, 2014

Choose a reason for hiding this comment

Ishiihara Sep 22, 2014

Choose a reason for hiding this comment

rnowling Sep 22, 2014

Choose a reason for hiding this comment

Ishiihara commented Sep 22, 2014

rnowling commented Sep 22, 2014

rnowling commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

SparkQA commented Sep 22, 2014

Ishiihara commented Sep 22, 2014

Ishiihara Sep 23, 2014

Choose a reason for hiding this comment

SparkQA commented Sep 23, 2014

SparkQA commented Sep 23, 2014

SparkQA commented Sep 23, 2014

SparkQA commented Sep 23, 2014

rnowling commented Sep 23, 2014

SparkQA commented Sep 23, 2014

SparkQA commented Sep 23, 2014

rnowling commented Sep 23, 2014

SparkQA commented Sep 23, 2014

SparkQA commented Sep 23, 2014

SparkQA commented Sep 23, 2014

mengxr Sep 23, 2014

Choose a reason for hiding this comment

rnowling commented Sep 23, 2014

mengxr commented Sep 23, 2014

rnowling commented Sep 24, 2014

mengxr commented Sep 25, 2014

SparkQA commented Sep 25, 2014

SparkQA commented Sep 25, 2014

AmplabJenkins commented Sep 25, 2014

rnowling commented Sep 25, 2014

mengxr commented Sep 25, 2014

mengxr commented Sep 25, 2014

SparkQA commented Sep 25, 2014

SparkQA commented Sep 25, 2014

AmplabJenkins commented Sep 25, 2014

rnowling commented Sep 25, 2014

harishreedharan commented Sep 25, 2014

mengxr commented Sep 26, 2014

SparkQA commented Sep 26, 2014

SparkQA commented Sep 26, 2014

AmplabJenkins commented Sep 26, 2014

mengxr commented Sep 26, 2014

rnowling commented Sep 26, 2014