Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix formula in gensim.summarization.bm25. Fix #1828 #1833

Merged
merged 16 commits into from
Jan 11, 2018
4 changes: 3 additions & 1 deletion gensim/summarization/bm25.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,12 +78,14 @@ def __init__(self, corpus):
self.f = []
self.df = {}
self.idf = {}
self.doc_len = []
self.initialize()

def initialize(self):
"""Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies."""
for document in self.corpus:
frequencies = {}
(self.doc_len).append(len(document))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless ( ), please remove.

for word in document:
if word not in frequencies:
frequencies[word] = 0
Expand Down Expand Up @@ -122,7 +124,7 @@ def get_score(self, document, index, average_idf):
continue
idf = self.idf[word] if self.idf[word] >= 0 else EPSILON * average_idf
score += (idf * self.f[index][word] * (PARAM_K1 + 1)
/ (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * len(document) / self.avgdl)))
/ (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.doc_len[index] / self.avgdl)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add several simple tests for BM25

return score

def get_scores(self, document, average_idf):
Expand Down