store extracted-text at the work level #298

rococodogs · 2019-10-08T21:17:02Z

tl;dr

this moves where a work's extracted text is stored and, while opening up possibilities for enhancement and integration of some features, diverges from samvera common practices.

the headline change introduced here is that we're storing extracted-text (at the index level) in the work's record. this has the potential to introduce some unforseen problems, as it breaks from the common samvera practice of indexing (and not storing) extracted-text at the file-set
level.

some background

when ingesting a file into fedora, the underlying samvera framework ensures that, if available, text content is extracted and stored as a file alongside the asset. this task is defined within the hydra-derivatives gem (see Hydra::Derivatives::PersistBasicContainedOutputFileService). in the hydra-works gem, this content is accessible as part of the file_set model, which delegates the text to the attached files (see Hydra::Works::ContainedFiles). at index time, the extracted-text is indexed, but not stored (presumably to prevent duplication, as the content is already stored in fedora) (see Hyrax::FileSetIndexer).

how do we ensure that full-text content is searched when using 'all_fields'? hyrax uses a solr join query to attach file-sets to their works. this is defined in the [Hyrax::CatalogSearchBuilder], which moves the user-provided query to its own url parameter and replaces the q: parameter with the solr join. so, a search with the parameters:

{
  q: 'dance',
  search_field: 'all_fields'
}

is transformed to:

{
  q: '{!lucene}_query_:"{!dismax v=$user_query}" _query_:"{!join from=id to=file_set_ids_ssim}{!dismax v=$user_query}"',
  user_query: 'dance',
  search_field: 'all_fields',
  defType: 'lucene'
}

this worked perfectly fine until we tried integrating the blacklight_advanced_search gem, which rewrites a query with better boolean and parenthesis support. this meant that, when searching 'all_fields' in the advanced mode, the join query would be replaced with the generated advanced query and file_sets would then be excluded from the search results. in a collection like our newspapers and/or magazines, where the metadata is quite limited, this meant that a search that previously yielded results would now return nothing, which isn't useful at all!

at the same time, we've had an outstanding request from stakeholders to highlight search-result terms within documents, as it was viewed as redundant (and annoying) to search for a term, then go into a search result, and have to search for that term within the content as well. this was partially solved early-on by injecting the search term into the PDF.js viewer (see #14, 6f9fe7b in particular). however, solr (which is used for full-text extraction during indexing) and PDF.js (which generates a search index each time the file is loaded) differ in how they parse text, resulting in an item qualifying as a match in the search, but the PDF viewer not displaying the result in-text. this is particularly prominent in phrase queries, where the PDF.js search may not show results because a single space character was interpreted as multiple spaces. we can help remediate this by displaying match-highlighting in the search results. this functionality is provided by solr and blacklight but not readily available in hyrax because a) the content is only indexed + not stored, and b) a solr join will not return the matched content (as an SQL JOIN would) but instead return only the parent document as a match. however, this is possible when storing the content at the work level.

so at this point, we have the following points for storing the extracted text alongside the work (rather than the file_set):

pros

better ability to integrate blacklight_advanced_search gem into a hyrax app
ability to display solr highlighting hits in search results

cons

more diskspace needed for the index (since we're storing full-text content)
breaks from samvera conventions

in case of emergencies

it should be noted that, even though we would be breaking from the samvera convention, we're only doing so within the solr index, which is seen as dynamic and not the source-of-truth for the repository. reverting the changes would require a patch, updating the specs, and running ActiveFedora::Base.reindex_everything.

an updated hyrax codebase may affect this negatively; certainly the move to valkyrie will require some updates. but, as best i could find, the only hyrax piece really affected is the search builder. nothing else seems to need to poke into the indexed full-text content.

as far as a larger index size, the indexer behavior can be modified to add the text as _timv (indexed, not stored) instead of _tsimv (stored and indexed). this would prevent us from showing contextual search results, but would still allow us to use the blacklight_advanced_search gem.

todo

add specs
expand this PR text

closes #145
closes #279

codeclimate · 2019-10-09T13:49:57Z

Code Climate has analyzed commit 8db4987 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (100% is the threshold).

This pull request will bring the total coverage in the repository to 98.3% (0.0% change).

View more on Code Climate.

this'll allow us to test this branch w/o having to reingest everything that's live on stage

rococodogs changed the base branch from master to develop October 9, 2019 12:30

rococodogs changed the title ~~display in-context highlighting on search results~~ store extracted-text at the work level Oct 9, 2019

rococodogs added the 👩‍🔬 experiment for PRs - messing around! label Oct 9, 2019

rococodogs mentioned this pull request Oct 11, 2019

advanced search updates #286

Closed

rococodogs added 7 commits October 11, 2019 14:56

store/index full-text content at the publication level

69e9da6

don't index full-text content at the file_set level

5ae12bb

use extracted_text_tsimv field for searches

768a4e1

display solr highlight hits for full-text on search results

a2f534f

add/update specs

dcb15a6

use fastVector for highlighting text

b1d566d

update labels for advanced-search form

c6f20e2

rococodogs force-pushed the experiment/full-text-at-work-level branch from cfefc4d to c6f20e2 Compare October 11, 2019 18:56

rococodogs added 2 commits October 11, 2019 15:02

revert d3bdec7 temporarily

473a563

this'll allow us to test this branch w/o having to reingest everything that's live on stage

hard-code labels for search fields for now

d125d82

rococodogs marked this pull request as ready for review October 15, 2019 21:05

rococodogs added 3 commits October 15, 2019 17:06

reinstate d3bdec7 (/ldr base for prod fedora)

8b20e82

remove all_text_timv solr copy field

79bda40

make displaying highlighted search results a toggleable feature

8b25e6e

rococodogs merged commit 02c7eec into develop Oct 16, 2019

rococodogs deleted the experiment/full-text-at-work-level branch October 16, 2019 12:27

rococodogs mentioned this pull request Oct 16, 2019

2019.1-pre.7 #293

Merged

rococodogs mentioned this pull request Jan 31, 2020

add full-text context display to search results #35

Closed

rococodogs mentioned this pull request Nov 30, 2021

remove toggle for contextual search results #771

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store extracted-text at the work level #298

store extracted-text at the work level #298

rococodogs commented Oct 8, 2019 •

edited

Loading

codeclimate bot commented Oct 9, 2019

store extracted-text at the work level #298

store extracted-text at the work level #298

Conversation

rococodogs commented Oct 8, 2019 • edited Loading

tl;dr

some background

pros

cons

in case of emergencies

todo

codeclimate bot commented Oct 9, 2019

rococodogs commented Oct 8, 2019 •

edited

Loading