-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3143][MLLIB] add tf-idf user guide #2061
Conversation
QA tests have started for PR 2061 at commit
|
QA tests have finished for PR 2061 at commit
|
vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. | ||
Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`. | ||
Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`. | ||
And document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...$d$
. And..." -> "...$d$
, while..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
QA tests have started for PR 2061 at commit
|
QA tests have finished for PR 2061 at commit
|
## Word2Vec | ||
|
||
Word2Vec computes distributed vector representation of words. The main advantage of the distributed | ||
[Word2Vec](https://code.google.com/p/word2vec/) computes distributed vector representation of words. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "distributed" mean in "distributed vector representation"? Does it refer to the fact that the computation is distributed? If so, could we say "...computes vector representation of words in a distributed fashion."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in the original paper and the term "distributed" is from http://www.indiana.edu/~clcl/BEAGLE/Jones_Mewhort_PR.pdf . I have trouble understanding "distributed vector representation" as well. I think "distributed" means we map a single word to multiple values ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is independent of this PR. Does the current doc look good to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the TF-IDF stuff LGTM.
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar Author: Xiangrui Meng <[email protected]> Closes #2061 from mengxr/tfidf-doc and squashes the following commits: ca04c70 [Xiangrui Meng] address comments a5ea4b4 [Xiangrui Meng] add tf-idf user guide (cherry picked from commit e157187) Signed-off-by: Xiangrui Meng <[email protected]>
I've merged this into master and branch-1.1. Thanks @atalwalkar for reviewing! |
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar Author: Xiangrui Meng <[email protected]> Closes apache#2061 from mengxr/tfidf-doc and squashes the following commits: ca04c70 [Xiangrui Meng] address comments a5ea4b4 [Xiangrui Meng] add tf-idf user guide
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. @atalwalkar