-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add 2 blog posts on llm rel assess-sigir from 2024
- Loading branch information
Lucas Roberts
authored and
Lucas Roberts
committed
Jan 2, 2025
1 parent
7a2e3d6
commit 99aa9b0
Showing
5 changed files
with
103 additions
and
60 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
--- | ||
title: 'Large Language Models can Accurately Predict | ||
Searcher Preferencess' | ||
date: 2025-01-02 | ||
permalink: /posts/2025/01/blog-post-1/ | ||
tags: | ||
- IR | ||
- LLM | ||
- Annotations | ||
- paper-notes | ||
--- | ||
|
||
|
||
The main idea of the paper is to use an | ||
[LLM as a judge](https://arxiv.org/abs/2306.05685), something initially | ||
described by researchers working on LLMs and not IR. | ||
The technique is nonetheless often found useful, indeed there are probably | ||
some other ways this method can be leveraged that have yet to be discovered. | ||
|
||
We focus on the [search preference paper](https://arxiv.org/abs/2309.10621) | ||
incidentally the authors work on bing search engine at microsoft. | ||
|
||
Looking at LLMs as a judge is a large and varied topic, it may work well in | ||
some domains and poorly in others. It really depends on the specifics of the | ||
task requiring labels and the specifics of the system used to generate the | ||
labels. This paper focuses on the search domain with queries and relevance | ||
assessments. | ||
|
||
In the paper, the authors investigate 5 prompt variations | ||
called R, D, N, A, and M. The authors looked at the various arrangements of | ||
these-each variation corresponds to some text that may be included in the | ||
prompt. For performance metrics they look at Cohen kappa, MAE and AUC using | ||
ia stratified sample of 3000 query-document pairs, cf. Table 1. To get | ||
confidence intervals on these metrics the authors bootstrapped. The DNA prompt | ||
variant, including D, N, and A substrings to the text, had the highest Cohen | ||
kappa of 0.64. | ||
|
||
In Section 4.5 they also investigate if the Query difficulty correlates with | ||
human assessments using the Precision@10 metric. The idea is interesting and | ||
one critique here is that in practical search scenarios you may not have ground | ||
truth relevance assessments on which to determine relevances. | ||
They retrieve to depth 100 on a TREC dataset and use RBO-rank biased overlap-to | ||
compare the different lists. For the bing search engine the authors report: | ||
|
||
|
||
"We have been using LLMs, in conjunction with expert human | ||
labellers, for most of our offline metrics since late 2022." | ||
|
||
and find good utility in doing so. The ground truth corpus they use comprises | ||
queries, descriptions of need, metadata like location and date, and at least | ||
two example results per query. | ||
Results are tagged—again, by the real searcher—as being good, neutral, or bad. | ||
|
||
The bing team monitors the health of the system by sampling LLM judged results | ||
and going over those to monitor error and similar metrics. They also mention | ||
that in both the TREC & Bing datasets, the task descriptions are very clear and | ||
use a single LLM whereas [Liang et al.](https://arxiv.org/abs/2211.09110) saw | ||
large differences from model to model over a range of tasks. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
--- | ||
title: 'Umbrela paper notes' | ||
date: 2025-01-02 | ||
permalink: /posts/2025/01/blog-post-2/ | ||
tags: | ||
- IR | ||
- LLM | ||
- Annotations | ||
- paper-notes | ||
--- | ||
|
||
|
||
The paper [UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing | ||
RELevance Assessor](https://arxiv.org/abs/2406.06519) looks at how LLMs can be | ||
leveraged for relevance assessments. | ||
Prior art in this area includes | ||
[work at Microsoft](https://arxiv.org/pdf/2309.10621) which the authors cite. | ||
|
||
The main idea of the paper is to use an | ||
[LLM as a judge](https://arxiv.org/abs/2306.05685), something initially | ||
described by researchers working on LLMs and not IR. | ||
The technique is nonetheless often found useful, indeed there are probably | ||
some other ways this method can be leveraged that have yet to be discovered. | ||
|
||
The paper is an open source reimplementation of a paper by researchers at | ||
microsoft working on bing. A paper I [put notes up on recently](). | ||
|
||
The authors used OpenAI and the microsoft version as points of comparison. | ||
There does not appear to be comparisons across open/closed source LLMs for this | ||
task as of yet. For metrics the authors use NDCG@10 which differs from the bing | ||
team's paper. The authors evaluate the method in TREC DL tracks from 2019-2023. | ||
The prompt is documented both in the paper and in the github repo-see beneath | ||
|
||
The case study of 2 query-passage pairs highlights the challenges and | ||
benefits of leveraging the approach. The cases demonstrate ambiguity that | ||
could have been the result of either inaccurate assessments or | ||
incomplete information regarding the user intent. | ||
|
||
The paper is accompanied by the | ||
[associated github repo](https://github.com/castorini/umbrela) which is great | ||
for reproducibility. | ||
While I like that the system as implemented allows reproducible research, the | ||
dependencies include JDK and all of PySerini. | ||
While this certain is an implementation detail it would be nice to see some | ||
sort of plugin architecture leveraged for extensions like this one. |