Data Fusion of Synthetic Query Variants With Generative Large Language Models

This repository accompanies our SIGIR-AP'24 paper submission entitled "Data Fusion of Synthetic Query Variants With Generative Large Language Models". It contains the code and results to make the experiments transparent and reproducible. All of the experiments can be reproduced with the help of the Jupyter Notebooks. Before rerunning the code, the test collections have to be obtained and placed in the correct directories to index them with ir_datasets. Please refer to the corresponding links below for more details about the data preparation, and also to directly access the generated query variants.

Abstract

Considering query variance in Information Retrieval (IR) experiments is beneficial for retrieval effectiveness. Especially ranking ensembles based on different topically related queries retrieve better results than rankings based on a single query alone. Recently, generative instruction-tuned Large Language Models (LLMs) improved on a variety of different tasks in capturing human language. To this end, this work explores the feasibility of using synthetic query variants generated by instruction-tuned LLMs in data fusion experiments. More specifically, we introduce a lightweight, unsupervised, and cost-efficient approach that exploits principled prompting and data fusion techniques. In our experiments, LLMs produce more effective queries when provided with additional context information on the topic. Furthermore, our analysis based on four TREC newswire benchmarks shows that data fusion based on synthetic query variants is significantly better than baselines with single queries and also outperforms pseudo-relevance feedback methods. We publicly share the code and query datasets with the community as resources for follow-up studies.

Query datasets

The query variants datasets can be found in the queries/ directory. To make the datasets interoperable, they are stored as csv files (separated with ;). Each line follows the following format of <query number>;<prompt strategy>;<topic number>;<query string>, e.g., 1;P-1;303;Hubble Telescope discoveries. The datasets can also be accessed with the links provided above.

Directory overview

Directory	Description
`figures/`	Figures of the paper.
`indices/`	Empty directory, indices will be created here.
`qrels/`	Qrels files of Core17, Core18, Robust04, and Robust05.
`queries/`	Query datasets.
`runs/`	Fused rankings and baselines.
`src/`	Notebooks for running the experiments (more details below).
`topics/`	Topic files of Core17, Core18, Robust04, and Robust05.

Notebooks overview

To rerun the query generation, experiments, and evaluations, execute the notebooks in the order as listed in the table below. Before generating the queries, make sure to store your OpenAI credentials in an environment variable.

Notebook	Description
`query_generation.ipynb`	Generate queries with different prompts.
`datasets_indexing.ipynb`	Index datasets of Core17, Core18, Robust04, and Robust05
`retrieval_and_data_fusion.ipynb`	Retrieve rankings and fuse them.
`evaluations.ipynb`	Evaluate fused retrieval effectiveness with different prompts, and with different numbers of queries.

Citation

@inproceedings{sigirap24/data_fusion,
  author       = {Timo Breuer},
  title        = {Data Fusion of Synthetic Query Variants 
                  With Generative Large Language Models},
  booktitle    = {Proceedings of the 2024 Annual International 
                  ACM SIGIR Conference on Research and 
                  Development in Information Retrieval in the 
                  Asia Pacific Region, December 9--12, 2024, 
                  Tokyo, Japan},
  publisher    = {{ACM}},
  year         = {2024},
  url          = {https://doi.org/10.1145/3673791.3698423},
  doi          = {10.1145/3673791.3698423}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
figures		figures
indices		indices
qrels		qrels
queries/gpt-4o		queries/gpt-4o
runs		runs
src		src
topics		topics
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Fusion of Synthetic Query Variants With Generative Large Language Models

Abstract

Query datasets

Directory overview

Notebooks overview

Citation

About

Releases 1

Packages

Languages

breuert/sigirap24

Folders and files

Latest commit

History

Repository files navigation

Data Fusion of Synthetic Query Variants With Generative Large Language Models

Abstract

Query datasets

Directory overview

Notebooks overview

Citation

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages