add 2 blog posts on llm rel assess-sigir from 2024

rlucas7 · Jan 2, 2025 · 99aa9b0 · 99aa9b0
1 parent 7a2e3d6
commit 99aa9b0
Show file tree

Hide file tree

Showing 5 changed files with 103 additions and 60 deletions.
diff --git a/_posts/2013-08-14-blog-post-2.md b/_posts/2013-08-14-blog-post-2.md
diff --git a/_posts/2014-08-14-blog-post-3.md b/_posts/2014-08-14-blog-post-3.md
diff --git a/_posts/2015-08-14-blog-post-4.md b/_posts/2015-08-14-blog-post-4.md
diff --git a/_posts/2025-01-02-llm-rel-assess-notes b/_posts/2025-01-02-llm-rel-assess-notes
@@ -0,0 +1,58 @@
+---
+title: 'Large Language Models can Accurately Predict
+Searcher Preferencess'
+date: 2025-01-02
+permalink: /posts/2025/01/blog-post-1/
+tags:
+  - IR
+  - LLM
+  - Annotations
+  - paper-notes
+---
+
+
+The main idea of the paper is to use an
+[LLM as a judge](https://arxiv.org/abs/2306.05685), something initially
+described by researchers working on LLMs and not IR.
+The technique is nonetheless often found useful, indeed there are probably
+some other ways this method can be leveraged that have yet to be discovered.
+
+We focus on the [search preference paper](https://arxiv.org/abs/2309.10621)
+incidentally the authors work on bing search engine at microsoft.
+
+Looking at LLMs as a judge is a large and varied topic, it may work well in
+some domains and poorly in others. It really depends on the specifics of the
+task requiring labels and the specifics of the system used to generate the
+labels. This paper focuses on the search domain with queries and relevance
+assessments.
+
+In the paper, the authors investigate 5 prompt variations
+called R, D, N, A, and M. The authors looked at the various arrangements of
+these-each variation corresponds to some text that may be included in the
+prompt. For performance metrics they look at Cohen kappa, MAE and AUC using
+ia stratified sample of 3000 query-document pairs, cf. Table 1. To get
+confidence intervals on these metrics the authors bootstrapped. The DNA prompt
+variant, including D, N, and A substrings to the text, had the highest Cohen
+kappa of 0.64.
+
+In Section 4.5 they also investigate if the Query difficulty correlates with
+human assessments using the Precision@10 metric. The idea is interesting and
+one critique here is that in practical search scenarios you may not have ground
+truth relevance assessments on which to determine relevances.
+They retrieve to depth 100 on a TREC dataset and use RBO-rank biased overlap-to
+compare the different lists. For the bing search engine the authors report:
+
+
+"We have been using LLMs, in conjunction with expert human
+labellers, for most of our offline metrics since late 2022."
+
+and find good utility in doing so. The ground truth corpus they use comprises
+queries, descriptions of need, metadata like location and date, and at least
+two example results per query.
+Results are tagged—again, by the real searcher—as being good, neutral, or bad.
+
+The bing team monitors the health of the system by sampling LLM judged results
+and going over those to monitor error and similar metrics. They also mention
+that in both the TREC & Bing datasets, the task descriptions are very clear and
+use a single LLM whereas [Liang et al.](https://arxiv.org/abs/2211.09110) saw
+large differences from model to model over a range of tasks.
diff --git a/_posts/2025-01-02-umbrella-notes b/_posts/2025-01-02-umbrella-notes
@@ -0,0 +1,45 @@
+---
+title: 'Umbrela paper notes'
+date: 2025-01-02
+permalink: /posts/2025/01/blog-post-2/
+tags:
+  - IR
+  - LLM
+  - Annotations
+  - paper-notes
+---
+
+
+The paper [UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing
+RELevance Assessor](https://arxiv.org/abs/2406.06519) looks at how LLMs can be
+leveraged for relevance assessments.
+Prior art in this area includes
+[work at Microsoft](https://arxiv.org/pdf/2309.10621) which the authors cite.
+
+The main idea of the paper is to use an
+[LLM as a judge](https://arxiv.org/abs/2306.05685), something initially
+described by researchers working on LLMs and not IR.
+The technique is nonetheless often found useful, indeed there are probably
+some other ways this method can be leveraged that have yet to be discovered.
+
+The paper is an open source reimplementation of a paper by researchers at
+microsoft working on bing. A paper I [put notes up on recently]().
+
+The authors used OpenAI and the microsoft version as points of comparison.
+There does not appear to be comparisons across open/closed source LLMs for this
+task as of yet. For metrics the authors use NDCG@10 which differs from the bing
+team's paper. The authors evaluate the method in TREC DL tracks from 2019-2023.
+The prompt is documented both in the paper and in the github repo-see beneath
+
+The case study of 2 query-passage pairs highlights the challenges and
+benefits of leveraging the approach. The cases demonstrate ambiguity that
+could have been the result of either inaccurate assessments or
+incomplete information regarding the user intent.
+
+The paper is accompanied by the
+[associated github repo](https://github.com/castorini/umbrela) which is great
+for reproducibility.
+While I like that the system as implemented allows reproducible research, the
+dependencies include JDK and all of PySerini.
+While this certain is an implementation detail it would be nice to see some
+sort of plugin architecture leveraged for extensions like this one.