Skip to content

Latest commit

 

History

History
110 lines (75 loc) · 2.56 KB

README.md

File metadata and controls

110 lines (75 loc) · 2.56 KB

Tools

Installation

pip install git+https://github.com/binshengliu/irtools.git

run_query_distributed.py

Run IndriRunQuery on a dask cluster.

Specify indri path, scheduler address, and Indri parameter files.

run_query_distributed.py --indri /path/to/IndriRunQuery --scheduler segsresap09:8786 **.param

If there are too many parameter files, it's better to pass the parameters using stdin, otherwise the shell may not function well.

find . -name "*.param" | run_query_distributed.py --indri /path/to/IndriRunQuery --scheduler segsresap09:8786

Dask workers may panic if there are too many tasks pending. A workaround is to control the task flow manually.

find . -name "*.param" | run_query_distributed.py --dry | split -l 24 - splited
for f in splited*; do
  cat $f | run_query_distributed.py --indri /path/to/IndriRunQuery --scheduler segsresap09:8786
done

rm3.py

Generate RM3 sweeping param files.

rm3.py --param robust.param --run robust.run --index /index/ROB04 --docs 25,30 --terms 50,100 --origs 0.1,0.2 --output test --rerank

eval_run.py

Handy for evaluating many run files for many metrics. It exploits multiprocessing so it's very fast.

eval_run.py --measure map,P@5,gdeval_ndcg@20 --sort gdeval_ndcg@20 cw09b.qrels a.run b.run c.run

ttest_runs.py

T-test two run files for multiple measurements. It also exploits multiprocessing.

ttest_runs.py --measure map,P,gdeval@20 cw09b.qrels a.run b.run

filter_spam.py

Filter run files by waterloo spam score. It exploits multiprocessing for loading the huge spam score file. Usually the bottleneck is the read speed of the storage system.

usage: filter_spam.py [-h] --score [1-100] [--count COUNT]
                      [--output DIRECTORY] [--force]
                      SCORE-FILE RUN [RUN ...]
filter_spam.py --count 1000 --score 50 ClueWeb09B_Spam_Fusion.txt a.run b.run

filter_oracle_run.py

Filter a run file for true relevant documents.

filter_oracle_run.py -n 5 -r 2 cw09b.qrels a.run > a.filtered.run

This command filters a.run for the first 5 documents with relevance >= 2. If there are not enough documents for this criteria, it will try relevance 1 and 0 then.

fuse_linear.py

Interpolate scores in multiple rank lists. Also exploits multiprocessing.

usage: fuse_linear.py [-h] (--weight WEIGHT | --sweep) RUN [RUN ...]

each_server.sh

each_server.sh 'ps -ef | grep Indri | grep -v grep'

set_dask_worker_nofile.sh

set_dask_worker_nofile.sh