Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1010] Add Solr instructions for prebuilt Lucene index #1379

Conversation

adamyy
Copy link
Contributor

@adamyy adamyy commented Sep 11, 2020

#1010

  • Provides instructions about how to configure Solr to read prebuilt Lucene indexes

@adamyy adamyy changed the title Add Solr instructions for prebuilt Lucene index [#1010] Add Solr instructions for prebuilt Lucene index Sep 11, 2020
@adamyy adamyy force-pushed the adamyy/1010/add-solr-instructions-for-prebuilt-lucene-index branch from 8595798 to 5aaaa88 Compare September 11, 2020 16:22
@codecov
Copy link

codecov bot commented Sep 11, 2020

Codecov Report

Merging #1379 into master will decrease coverage by 0.18%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1379      +/-   ##
============================================
- Coverage     53.61%   53.42%   -0.19%     
+ Complexity      825      818       -7     
============================================
  Files           154      154              
  Lines          8627     8627              
  Branches       1224     1224              
============================================
- Hits           4625     4609      -16     
- Misses         3623     3641      +18     
+ Partials        379      377       -2     
Impacted Files Coverage Δ Complexity Δ
...java/io/anserini/ltr/feature/CountBigramPairs.java 64.93% <0.00%> (-22.08%) 23.00% <0.00%> (-8.00%)
...anserini/ltr/feature/base/PMIFeatureExtractor.java 86.53% <0.00%> (+1.92%) 13.00% <0.00%> (+1.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 748f949...d611d68. Read the comment docs.

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, thanks for the PR!

docs/solrini.md Outdated Show resolved Hide resolved
docs/solrini.md Outdated Show resolved Hide resolved
docs/solrini.md Outdated Show resolved Hide resolved
@adamyy
Copy link
Contributor Author

adamyy commented Sep 12, 2020

Actually, as I was testing the prebuilt index with run_solr_integration.py, I noticed that there are discrepancy between expected and actual trec_eval benchmark. Currently looking into the cause and this PR might not be ready to merge just yet!

@adamyy adamyy changed the title [#1010] Add Solr instructions for prebuilt Lucene index [Do not merge] Add Solr instructions for prebuilt Lucene index Sep 12, 2020
@adamyy
Copy link
Contributor Author

adamyy commented Sep 12, 2020

Looked a bit more into this, turns out my index was not built with the contents field (only id and raw) stored and SearchSolr uses contents as the default field of search. So SearchSolr either errors out / produce unexpected output.

I guess it is worth noting that the prebuilt index should have the contents field for this to work properly? When we are indexing into Solr using IndexCollection, it seems like the contents field are created regardless of whether we specify -storeContents or not (likely thanks to https://github.com/castorini/anserini/blob/master/src/main/resources/solr/anserini/conf/managed-schema#L201). However, if we are just producing a Lucene index then contents field only present if we specify -storeContents.

@lintool

@adamyy adamyy changed the title [Do not merge] Add Solr instructions for prebuilt Lucene index [#1010] Add Solr instructions for prebuilt Lucene index Sep 12, 2020
@lintool
Copy link
Member

lintool commented Sep 12, 2020

yup, this should be documented somewhere!

@lintool
Copy link
Member

lintool commented Sep 12, 2020

Also, check out: https://github.com/castorini/anserini/blob/master/.travis.yml#L20

It might be reasonable to do a complete e2e integration test based on this? I mean, travis is running it all the time... just a matter of moving it inside JUnit?

@adamyy
Copy link
Contributor Author

adamyy commented Sep 12, 2020

Actually, I have misunderstood the issue again. 😅

I read more about what stored means for a Lucene Document Field, and found this explanation. Turns out stored=false still makes the field searchable, it is just that the field is not stored as is, and will not be returned as part of the query result.

Then I was wondering, how come SearchCollection on non-storeContents MSMARCO index can produce the output that matches the regression, but SearchSolr cannot?

I noticed solrconfig.xml#L785 set the default field of search to a field called _text_ (this is the default Solr config). However, in our case we want it to be contents, and it is what we were trying to enforce at SearchSolr.java#L225. My hunch is that Solr internally picks the default field of search based on both query parameters and solrconfig.xml, and it might have picked the wrong one in my previous run.

I modified this field from _text_ to contents, recreated the Solr core with the updated configset and the non-storeContents index, then performed SearchSolr again. The good news is that the result now matches the regression. I will perform a couple more tests to ensure that this is what is causing the issue.

@adamyy
Copy link
Contributor Author

adamyy commented Sep 12, 2020

Also, check out: https://github.com/castorini/anserini/blob/master/.travis.yml#L20

It might be reasonable to do a complete e2e integration test based on this? I mean, travis is running it all the time... just a matter of moving it inside JUnit?

Do we want to do run_solr_regression.py in travis? @lintool

@adamyy
Copy link
Contributor Author

adamyy commented Sep 12, 2020

I modified this field from _text_ to contents, recreated the Solr core with the updated configset and the non-storeContents index, then performed SearchSolr again. The good news is that the result now matches the regression. I will perform a couple more tests to ensure that this is what is causing the issue.

Big oops, I tested my assumption again with a newly built index (without -storeContents flag) everything worked just fine. TL;DR I think my original index was broken. Sorry for causing the confusion! This PR should be ready now! @lintool

@lintool
Copy link
Member

lintool commented Sep 13, 2020

Do we want to do run_solr_regression.py in travis? @lintool

Maybe run_solr_regression.py with the small CACM collection in travis? Shouldn't be too difficult to do? If you think it's a good idea, create another issue to keep track of progress?

@lintool lintool merged commit bc2628b into castorini:master Sep 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants