Skip to content

Commit

Permalink
add 2 blog posts on llm rel assess-sigir from 2024
Browse files Browse the repository at this point in the history
  • Loading branch information
Lucas Roberts authored and Lucas Roberts committed Jan 2, 2025
1 parent 7a2e3d6 commit 99aa9b0
Show file tree
Hide file tree
Showing 5 changed files with 103 additions and 60 deletions.
20 changes: 0 additions & 20 deletions _posts/2013-08-14-blog-post-2.md

This file was deleted.

20 changes: 0 additions & 20 deletions _posts/2014-08-14-blog-post-3.md

This file was deleted.

20 changes: 0 additions & 20 deletions _posts/2015-08-14-blog-post-4.md

This file was deleted.

58 changes: 58 additions & 0 deletions _posts/2025-01-02-llm-rel-assess-notes
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: 'Large Language Models can Accurately Predict
Searcher Preferencess'
date: 2025-01-02
permalink: /posts/2025/01/blog-post-1/
tags:
- IR
- LLM
- Annotations
- paper-notes
---


The main idea of the paper is to use an
[LLM as a judge](https://arxiv.org/abs/2306.05685), something initially
described by researchers working on LLMs and not IR.
The technique is nonetheless often found useful, indeed there are probably
some other ways this method can be leveraged that have yet to be discovered.

We focus on the [search preference paper](https://arxiv.org/abs/2309.10621)
incidentally the authors work on bing search engine at microsoft.

Looking at LLMs as a judge is a large and varied topic, it may work well in
some domains and poorly in others. It really depends on the specifics of the
task requiring labels and the specifics of the system used to generate the
labels. This paper focuses on the search domain with queries and relevance
assessments.

In the paper, the authors investigate 5 prompt variations
called R, D, N, A, and M. The authors looked at the various arrangements of
these-each variation corresponds to some text that may be included in the
prompt. For performance metrics they look at Cohen kappa, MAE and AUC using
ia stratified sample of 3000 query-document pairs, cf. Table 1. To get
confidence intervals on these metrics the authors bootstrapped. The DNA prompt
variant, including D, N, and A substrings to the text, had the highest Cohen
kappa of 0.64.

In Section 4.5 they also investigate if the Query difficulty correlates with
human assessments using the Precision@10 metric. The idea is interesting and
one critique here is that in practical search scenarios you may not have ground
truth relevance assessments on which to determine relevances.
They retrieve to depth 100 on a TREC dataset and use RBO-rank biased overlap-to
compare the different lists. For the bing search engine the authors report:


"We have been using LLMs, in conjunction with expert human
labellers, for most of our offline metrics since late 2022."

and find good utility in doing so. The ground truth corpus they use comprises
queries, descriptions of need, metadata like location and date, and at least
two example results per query.
Results are tagged—again, by the real searcher—as being good, neutral, or bad.

The bing team monitors the health of the system by sampling LLM judged results
and going over those to monitor error and similar metrics. They also mention
that in both the TREC & Bing datasets, the task descriptions are very clear and
use a single LLM whereas [Liang et al.](https://arxiv.org/abs/2211.09110) saw
large differences from model to model over a range of tasks.
45 changes: 45 additions & 0 deletions _posts/2025-01-02-umbrella-notes
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: 'Umbrela paper notes'
date: 2025-01-02
permalink: /posts/2025/01/blog-post-2/
tags:
- IR
- LLM
- Annotations
- paper-notes
---


The paper [UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing
RELevance Assessor](https://arxiv.org/abs/2406.06519) looks at how LLMs can be
leveraged for relevance assessments.
Prior art in this area includes
[work at Microsoft](https://arxiv.org/pdf/2309.10621) which the authors cite.

The main idea of the paper is to use an
[LLM as a judge](https://arxiv.org/abs/2306.05685), something initially
described by researchers working on LLMs and not IR.
The technique is nonetheless often found useful, indeed there are probably
some other ways this method can be leveraged that have yet to be discovered.

The paper is an open source reimplementation of a paper by researchers at
microsoft working on bing. A paper I [put notes up on recently]().

The authors used OpenAI and the microsoft version as points of comparison.
There does not appear to be comparisons across open/closed source LLMs for this
task as of yet. For metrics the authors use NDCG@10 which differs from the bing
team's paper. The authors evaluate the method in TREC DL tracks from 2019-2023.
The prompt is documented both in the paper and in the github repo-see beneath

The case study of 2 query-passage pairs highlights the challenges and
benefits of leveraging the approach. The cases demonstrate ambiguity that
could have been the result of either inaccurate assessments or
incomplete information regarding the user intent.

The paper is accompanied by the
[associated github repo](https://github.com/castorini/umbrela) which is great
for reproducibility.
While I like that the system as implemented allows reproducible research, the
dependencies include JDK and all of PySerini.
While this certain is an implementation detail it would be nice to see some
sort of plugin architecture leveraged for extensions like this one.

0 comments on commit 99aa9b0

Please sign in to comment.