The role of LLMs-as-Judges is a rapidly developing research area, with a wealth of significant papers emerging at NeurIPS 2024. This repository organizes key papers from NeurIPS 2024, covering multiple directions, including the latest advancements in optimizing human preferences, model self-improvement, and the application of LLM judges in solving complex problems. Our aim is to provide researchers and developers with the latest theories and methods to drive the deeper development of the LLMs-as-Judges.
-
LLM Evaluators Recognize and Favor Their Own Generations
NeurIPS 2024. [Paper]
-
Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences
NeurIPS 2024. [Paper]
-
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare
NeurIPS 2024. [Paper]
-
Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale
NeurIPS 2024. [Paper]
-
Detecting Bugs with Substantial Monetary Consequences by LLM and Rule-based Reasoning
NeurIPS 2024. [Paper]
-
A Critical Evaluation of AI Feedback for Aligning Large Language Models
NeurIPS 2024. [Paper]
-
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
NeurIPS 2024. [Paper]
-
Verified Code Transpilation with LLMs
NeurIPS 2024. [Paper]
-
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models
NeurIPS 2024. [Paper]
-
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
NeurIPS 2024. [Paper]
-
Self-Discover: Large Language Models Self-Compose Reasoning Structures
NeurIPS 2024. [Paper]
-
Self-Retrieval: End-to-End Information Retrieval with One Large Language Model
NeurIPS 2024. [Paper]
-
LLM-AutoDA: Large Language Model-Driven Automatic Data Augmentation for Long-tailed Problems
NeurIPS 2024. [Paper]
-
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning
NeurIPS 2024. [Paper]
-
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
NeurIPS 2024. [Paper]
-
INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness
NeurIPS 2024. [Paper]
-
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
NeurIPS 2024. [Paper]
-
CriticEval: Evaluating Large Language Model as Critic
NeurIPS 2024. [Paper]
-
AlphaMath Almost Zero: Process Supervision without Process
NeurIPS 2024. [Paper]
-
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
NeurIPS 2024. [Paper]
-
On scalable oversight with weak LLMs judging strong LLMs
NeurIPS 2024. [Paper]
-
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
NeurIPS 2024. [Paper]
-
StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving
NeurIPS 2024. [Paper]
-
Reflective Multi-Agent Collaboration based on Large Language Models
NeurIPS 2024. [Paper]
-
A Theoretical Understanding of Self-Correction through In-context Alignment
NeurIPS 2024. [Paper]
-
Training LLMs to Better Self-Debug and Explain Code
NeurIPS 2024. [Paper]
-
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
NeurIPS 2024. [Paper]
-
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
NeurIPS 2024. [Paper]
-
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
NeurIPS 2024. [Paper]