-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Evaluation function ndgc_at_k returns incorrect values #1749
Comments
I would like to add the following points:
In order to understand it better I made a small toy example out of your example notebook here and tried to test them with your functions import pandas as pd
from recommenders.evaluation.python_evaluation import ndcg_at_k
df_true = pd.DataFrame(
{
"user_id": [1, 1, 1],
"item_id": [1, 2, 3],
"rating": [5, 4, 3]
}
)
df_pred = pd.DataFrame(
{
"user_id": [1, 1, 1],
"item_id": [3, 2, 1],
"prediction": [14, 13, 12]
}
)
K = 3
HEADER = {
"col_user": "user_id",
"col_item": "item_id",
"col_rating": "rating",
"col_prediction": "prediction",
}
print(f"The ndcg at k={K} is {ndcg_at_k(df_true, df_pred, k=K, relevancy_method='top_k', **HEADER)}")
# output: The ndcg at k=3 is 1.0 The results look really strange because I intentionally reversed the ordering of the recos and as far as I understood the whole point of using this metric was to catch if more important/relevant documents were ranked lower by the recommender then it should be penalized more. I also tried to repeat the same activity using the from recommenders.evaluation.spark_evaluation import SparkRankingEvaluation
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_pred_sp = spark.createDataFrame(df_pred)
df_true_sp = spark.createDataFrame(df_true)
print(f"The ndcg at k={K} is {SparkRankingEvaluation(df_true_sp, df_pred_sp, k=K, relevancy_method='top_k', **HEADER).ndcg_at_k()}")
# output: The ndcg at k=3 is 1.0 On a separate note the
Please correct me if I misunderstood. A few references:
Please treat this issue with a higher priority as this is an important metric that is being used to compare all the benchmarks. OFF TOPIC FEATURE REQUEST
Platform details: |
Could we have an update here? |
@riosv I guess the example on the wiki page is different from our case, in terms of how "relevance" is defined. That is, on the wiki page, the document has 4 different types of relevance scores, which are 0, 1, 2, and 3, while in our case, the relevant score has only 0 and 1 where 1 indicates a hit of the recommendation. 0 leads to no gain here so when calculating the cumulative gain, only 1 is counted. |
@deep-chakraborty in your example, the NDCG is scored as 1.0 because the predictions can always find a "relevance" in the ground truth no matter how they change the order. If you can mock an irrelevant item in the predictions, say "item 4", and position it into different places in the prediction dataframe, then you should be able to see the matter of order. |
So if I understood you correctly - changing the order of the recommendations shouldn't affect the score? This does not sound very intuitive tbh. Also if relevance being considered is always 1.0 then that means that the ground-truth items also don't have any ordering as they should always be ordered according to their "relevance" and yes the same relevance that is used to order them should be used to calculate the DCG of the recommendations.
Could you please cite a paper or example which you are using as reference for your implementation if its not the wiki? |
@deep-chakraborty the implementation was from Spark MLlib as stated in the docstring. For your question, I think that is a known limitation of NDCG. See the following from wiki page.
That is to say, in your example, all the prediction results get the relevance scores of 1, 1, 1, meaning that there is no point to rank them unless there is a missing or false prediction in the results, i.e., 0. I think the meaning behind this is that, when we recommend k items to users, we only care about the order when there are irrelevant items recommended. |
Thanks for citing the wiki I am indeed aware of the limitations of the nDCG. I am not sure we understand each other completely. My concern was not regarding "missing" documents but rather the "order" of the documents. I would like to ask again - how do we "sort" the ground truth? I hope we are not confusing between "hit" and "ranking" metrics here. It has been correctly defined in the example notebook. The "relevance" should be always defined using the scores/ratings that the user has given. Secondly, about the reference - I don't think you are completely following the Spark MLlib's implementation here, they are using the exponential form. In addition to this, I also pointed out an inconsistency in the Spark's implementation as well regarding the selection of the base of the logarithm. |
Thanks @deep-chakraborty. Here are my thoughts.
Hope I it makes sense. |
I am struggling to understand these two statements
If I understood you correctly you say "order only matters in case of irrelevant predicted items". Is it correct?
This definitely is probably something that is a "version" of nDCG implemented only for this repo and does not align with many use cases out there. I think we should take a step back and revisit why are we trying to measure the goodness of the model with this metric anyway. For example, if a user has seen 10 movies out of 1000 movies on Netflix then it is not fair to assume that the user "is going to interact anyway" with all those 900 movies that he has not watched. In that case, it is very important that the sorting of the recommended movies are "correct". For evaluation purposes, we sort the ground truth, let's say, 200 of the remaining 900 movies according to the ratings (explicit feedback) given by the user and that is how the "actual relevance" of each movie is taken into account. We ask, any general model/algorithm, to tell us which movies should the user watch today (first), next week (second), and so on for these 200 movies based on its own internal interpretation/representation of importance expressed as "predicted ratings". Now as we already know in what sequence/order the user would "actually" watch the movies we wish to compare the two ordered lists/sets and our goal is to quantify the difference. There are many ways to quantify it of which nDCG is one. So if we are unable to measure this "goodness" with the implementation of the nDCG done in this repo then I can't think of any good reason why this implementation should be used anyway especially considering the fact that the "hit" related ratings (as shown in the example notebook and in the screenshot) do exactly the same thing. In the light of these points I would strongly suggest you please take the sklearn.metrics.ndcg_score as reference instead of Spark MLlib as they have also taken into account the relevant papers (follow the discussion thread here) OR use it directly under the hood and just have a wrapper that handles the spark/pandas dataframes. |
A sample implementation from my end would be the following from math import log2
from typing import List
from typing import Literal
from typing import Optional
from typing import Union
import pandas as pd
from sklearn.metrics import ndcg_score
def _ndcg(
true: List[Union[str, int]],
pred: List[Union[str, int]],
ratings: Optional[List[Union[int, float]]] = None,
k: int = 3,
form: Literal["linear", "exp", "identity"] = "linear",
) -> float:
k_ = min(len(true), len(pred), k)
if ratings:
assert len(ratings) >= k_, "All true labels must have a rating"
assert ratings == sorted(ratings, reverse=True), "Ratings must be sorted in descending order"
rel_ = lambda x: ratings[true.index(x)] if x in true else 0 # noqa
else:
# make "rank" as the rating - useful in case of implicit feedback cases
rel_ = (
lambda x: (len(true) - true.index(x) + 1) if x in true else 0
) # noqa
if form == "exp":
rel = lambda x: 2 ** rel_(x) - 1 # noqa
elif form == "identity":
# this is the current implementation
rel = lambda x: 1.0 if x in true else 0.0 # noqa
else:
rel = rel_
disc = lambda i: log2(i + 2) # noqa
dcg_at_k_ = ideal_dcg_at_k_ = 0
for i, (ele, ele2) in enumerate(zip(pred[:k_], true[:k_])):
dcg_at_k_ += rel(ele) / disc(i) # type: ignore
ideal_dcg_at_k_ += rel(ele2) / disc(i) # type: ignore
return dcg_at_k_ / ideal_dcg_at_k_
def ndcg(
df_true: pd.DataFrame,
df_pred: pd.DataFrame,
user_col_name: str = "user_id",
item_col_name: str = "item_id",
rating_col_name: str = "rating",
prediction_col_name: str = "prediction",
top_k: int = 3,
form: Literal["linear", "exp", "identity"] = "linear",
consider_actual_ratings: bool = False,
show_debug_output: bool = False,
) -> float:
def agg_sort(df: pd.DataFrame, sort_col_name: str):
return (
df.sort_values(by=[user_col_name, sort_col_name], ascending=False)
.groupby(user_col_name)
.agg({item_col_name: list, sort_col_name: list})
.rename(
columns={
item_col_name: "pred" if "prediction" in sort_col_name else "true"
}
)
)
df_pred_ = agg_sort(df_pred, prediction_col_name)
df_true_ = agg_sort(df_true, rating_col_name)
df_final_ = df_true_.join(df_pred_, on=user_col_name, how="inner")
if consider_actual_ratings:
df_final_["ndcg"] = df_final_.apply(
lambda x: _ndcg(x["true"], x["pred"], x["rating"], k=top_k, form=form),
axis=1,
)
else:
df_final_["ndcg"] = df_final_.apply(
lambda x: _ndcg(x["true"], x["pred"], k=top_k, form=form), axis=1
)
if show_debug_output:
print(df_final_)
return df_final_["ndcg"].mean()
if __name__ == "__main__":
df_true = pd.DataFrame(
{
"user_id": [1, 1, 1],
"item_id": ["a", "b", "c"],
"rating": [5, 4, 3],
}
)
df_pred = pd.DataFrame(
{
"user_id": [1, 1, 1],
"item_id": ["a", "b", "c"],
"prediction": [13.1, 14.2, 12.1],
}
)
assert round(ndcg(
df_true,
df_pred,
form="linear",
consider_actual_ratings=True
), 6) == round(
ndcg_score(
y_true = [[5, 4, 3]],
y_score= [[13.1, 14.2, 12.1]]
),
6
) Ofcourse, there are many corner cases that have not been taken care of in the above implementation but it is quite simple to understand and is consistent with other benchmark implementations out there (sklearn in this case). Hope it makes things clearer also would be happy to contribute if needed. |
Quick response - I agree - this is in line with my points in the earlier response. That is, if ratings are considered in relevance, we should involve them in calculating nDCG and that gives us a more in-depth insight about the recommendation quality w.r.t user preferences, i.e., ratings. NOTE we still want to keep hits-based approach because not all the times explicit ratings are obtainable and reliable. I am thinking of an extension of the existing implementation. Gratitude to your reply with codes where a possible implementation is proposed 😄 I guess something we can consider here is to make it generic, e.g., the function @miguelgfierro @gramhagen @anargyri you may want to take a look at the thread. |
Thank you for your response. I would like to add one point regarding
For that we can simply pass |
As @yueguoguo said, we use Spark MLLib definitions, a similar question arose with other metrics like MAP, see #1702. |
Thanks a lot for participating in this thread. Regarding the point that you raised -
I think for the datasets (that is the Movielens in most cases) where we are not in an "implicit feedback" scenario and already have the ratings from the user it is quite fair to assume that we use these ratings to sort the ground truth list of items. More so when we are using this dataset to benchmark most SOTA reco methods.
I totally agree and this is indeed the case with my current projects in my job right now. |
If no objections, I will close this issue since the above PR has been merged. @miguelgfierro @loomlike |
Yeah |
Description
The function
ndcg_at_k
returns incorrect values.For example, using the example described in https://en.wikipedia.org/wiki/Discounted_cumulative_gain:
As described in the wiki page, the correct value of NDCG@6 should be 0.785 but
ndcg_at_k
returns 1.0The cause could be that in the function's body, DCG and NDCG are calculated always using
1
as numerator while the formula requires to use the true relevance score. For example this is how DCG is calculated:df_dcg["dcg"] = 1 / np.log1p(df_dcg["rank"])
In which platform does it happen?
MacOS
Python 3.9.12
Recommenders 1.1.0
How do we replicate the issue?
Expected behavior (i.e. solution)
For the example described above
ndcg_at_k
should return 0.785Other Comments
The text was updated successfully, but these errors were encountered: