Solrini: Anserini Integration with Solr

This page documents code for replicating results from the following paper:

Ryan Clancy, Toke Eskildsen, Nick Ruest, and Jimmy Lin. Solr Integration in the Anserini Information Retrieval Toolkit. Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), July 2019, Paris, France.

We provide instructions for setting up a single-node SolrCloud instance running locally and indexing into it from Anserini. Instructions for setting up SolrCloud clusters can be found by searching the web.

Setting up a Single-Node SolrCloud Instance

From the Solr archives, download the Solr (non -src) version that matches Anserini's Lucene version to the anserini/ directory.

Extract the archive:

mkdir solrini && tar -zxvf solr*.tgz -C solrini --strip-components=1

Start Solr:

solrini/bin/solr start -c -m 8G

Adjust memory usage (i.e., -m 8G as appropriate).

Run the Solr bootstrap script to copy the Anserini JAR into Solr's classpath and upload the configsets to Solr's internal ZooKeeper:

pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd

Solr should now be available at http://localhost:8983/ for browsing.

The Solr index schema can also be modified using the Schema API. This is useful for specifying field types and other properties including multiValued fields.

Schemas for setting up specific Solr index schemas can be found in the src/main/resources/solr/schemas/ folder.

To set the schema, we can make a request to the Schema API:

curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json http://localhost:8983/solr/COLLECTION_NAME/schema

Indexing into SolrCloud from Anserini

We can use Anserini as a common "frontend" for indexing into SolrCloud, thus supporting the same range of test collections that's already included in Anserini (when directly building local Lucene indexes). Indexing into Solr is similar indexing to disk with Lucene, with a few added parameters. Most notably, we replace the -index parameter (which specifies the Lucene index path on disk) with Solr parameters. Alternatively, Solr can also be configured to read prebuilt Lucene index, since Solr uses Lucene indexes under the hood.

We'll index robust04 as an example. First, create the robust04 collection in Solr:

solrini/bin/solr create -n anserini -c robust04

Run the Solr indexing command for robust04:

sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator DefaultLuceneDocumentGenerator \
  -threads 8 -input /path/to/robust04 \
  -solr -solr.index robust04 -solr.zkUrl localhost:9983 \
  -storePositions -storeDocvectors -storeRaw

Make sure /path/to/robust04 is updated with the appropriate path.

Once indexing has completed, you should be able to query robust04 from the Solr query interface.

You can also run the following command to replicate Anserini BM25 retrieval:

sh target/appassembler/bin/SearchSolr -topicreader Trec \
  -solr.index robust04 -solr.zkUrl localhost:9983 \
  -topics src/main/resources/topics-and-qrels/topics.robust04.txt \
  -output run.solr.robust04.bm25.topics.robust04.txt

Evaluation can be performed using trec_eval:

eval/trec_eval.9.0.4/trec_eval -m map -m P.30 src/main/resources/topics-and-qrels/qrels.robust04.txt run.solr.robust04.bm25.topics.robust04.txt

These instructions can be straightforwardly adapted to work with the TREC Washington Post Corpus:

sh target/appassembler/bin/IndexCollection -collection WashingtonPostCollection -generator WapoGenerator \
   -threads 8 -input /path/to/WashingtonPost \
   -solr -solr.index core18 -solr.zkUrl localhost:9983 \
   -storePositions -storeDocvectors -storeContents

Make sure core18 collection is created and /path/to/WashingtonPost is updated with the appropriate path.

Solrini has also been verified to work with the MS MARCO Passage Retrieval Corpus. There should be no major issues with other collections that are supported by Anserini, but we have not tested them.

Solr with Prebuilt Lucene Index

Solr can be considered a front-end for Lucene, and it is entirely possible for Solr to read prebuilt Lucene indexes. To achieve this, some housekeeping are required. The following uses Robust04 as an example. Assuming your index files are stored under indexes/robust04/lucene-index.robust04.pos+docvectors+rawdocs/.

First, a Solr collection must be created to house the index. Here we create a collection robust04 with configset anserini.

solrini/bin/solr create -n anserini -c robust04

Along with the collection, Solr will create a core instance, whose name can be found in the Solr UI under collection overview. It might look something like <collection_name>_shard<id>_replica_<id> (e.g., robust04_shard1_replica_n1). Solr stores configurations and data for the core instances under Solr home, which for us is solrini/server/solr/ by default.

Second, make proper Solr schema adjustments if required. Here robust04 is a TREC collection whose schema is already taken care of by managed-schema in the Solr configset. However, if you are dealing with a collection such as cord19, remember to make proper adjustments to the Solr schema, as previously described.

curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json http://localhost:8983/solr/COLLECTION_NAME/schema

Then, copy/move the index files to where Solr expected. As previously established, Solr stores its index data in a directory called /data under the core’s instance directory (solrini/server/solr/<core-instance-directory>/data). You can simply copy your Lucene index files to /data/index and Solr will be able to pick them up from there.

cp indexes/robust04/lucene-index.robust04.pos+docvectors+rawdocs/* solrini/server/solr/robust04_shard1_replica_n1/data/index

Lastly, restart Solr to make sure changes are effective.

solrini/bin/solr stop
solrini/bin/solr start -c -m 8G

Solr integration test

We have an end-to-end integration testing script run_solr_regression.py. See example usage for core18 below:

# Check if Solr server is on
python src/main/python/run_solr_regression.py --ping

# Check if core18 exists
python src/main/python/run_solr_regression.py --check-index-exists core18

# Create core18 if it does not exist
python src/main/python/run_solr_regression.py --create-index core18

# Delete core18 if it exists
python src/main/python/run_solr_regression.py --delete-index core18

# Insert documents from /path/to/WashingtonPost into core18
python src/main/python/run_solr_regression.py --insert-docs core18 --input /path/to/WashingtonPost

# Search and evaluate on core18
python src/main/python/run_solr_regression.py --evaluate core18

To run end-to-end, issue the following command:

python src/main/python/run_solr_regression.py --regression core18 --input /path/to/WashingtonPost

The regression script has been verified to work for robust04, core18, and msmarco-passage.

Replication Log

Results replicated by @nikhilro on 2020-01-26 (commit 1882d84) for both Washington Post and Robust04
Results replicated by @edwinzhng on 2020-01-28 (commit a79cb62) for both Washington Post and Robust04
Results replicated by @nikhilro on 2020-02-12 (commit eff7755) for Washington Post core18, Robust04 robust04, and MS Marco Passage msmarco-passage using end-to-end run_solr_regression
Results replicated by @yuki617 on 2020-03-30 (commit ec8ee41) for MS Marco Passage msmarco-passage using end-to-end run_solr_regression
Results replicated by @HangCui0510 on 2020-04-29 (commit 31d843a) for MS Marco Passage msmarco-passage using end-to-end run_solr_regression
Results replicated by @shaneding on 2020-05-26 (commit bed8ead) for MS Marco Passage msmarco-passage using end-to-end run_solr_regression
Results replicated by @YimingDou on 2020-05-29 (commit 2947a16) for MS MARCO Passage msmarco-passage
Results replicated by @adamyy on 2020-05-29 (commit 2947a16) for MS Marco Passage msmarco-passage and MS Marco Document msmarco-doc using end-to-end run_solr_regression
Results replicated by @yxzhu16 on 2020-07-17 (commit fad12be) for Robust04 robust04, Washington Post core18, and MS Marco Passage msmarco-passage using end-to-end run_solr_regression

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

solrini.md

solrini.md

Solrini: Anserini Integration with Solr

Setting up a Single-Node SolrCloud Instance

Indexing into SolrCloud from Anserini

Solr with Prebuilt Lucene Index

Solr integration test

Replication Log

Files

solrini.md

Latest commit

History

solrini.md

File metadata and controls

Solrini: Anserini Integration with Solr

Setting up a Single-Node SolrCloud Instance

Indexing into SolrCloud from Anserini

Solr with Prebuilt Lucene Index

Solr integration test

Replication Log