Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store extracted-text at the work level #298

Merged
merged 12 commits into from
Oct 16, 2019

Conversation

rococodogs
Copy link
Member

@rococodogs rococodogs commented Oct 8, 2019

tl;dr

this moves where a work's extracted text is stored and, while opening up possibilities for enhancement and integration of some features, diverges from samvera common practices.

the headline change introduced here is that we're storing extracted-text (at the index level) in the work's record. this has the potential to introduce some unforseen problems, as it breaks from the common samvera practice of indexing (and not storing) extracted-text at the file-set
level.

some background

when ingesting a file into fedora, the underlying samvera framework ensures that, if available, text content is extracted and stored as a file alongside the asset. this task is defined within the hydra-derivatives gem (see Hydra::Derivatives::PersistBasicContainedOutputFileService). in the hydra-works gem, this content is accessible as part of the file_set model, which delegates the text to the attached files (see Hydra::Works::ContainedFiles). at index time, the extracted-text is indexed, but not stored (presumably to prevent duplication, as the content is already stored in fedora) (see Hyrax::FileSetIndexer).

how do we ensure that full-text content is searched when using 'all_fields'? hyrax uses a solr join query to attach file-sets to their works. this is defined in the [Hyrax::CatalogSearchBuilder], which moves the user-provided query to its own url parameter and replaces the q: parameter with the solr join. so, a search with the parameters:

{
  q: 'dance',
  search_field: 'all_fields'
}

is transformed to:

{
  q: '{!lucene}_query_:"{!dismax v=$user_query}" _query_:"{!join from=id to=file_set_ids_ssim}{!dismax v=$user_query}"',
  user_query: 'dance',
  search_field: 'all_fields',
  defType: 'lucene'
}

this worked perfectly fine until we tried integrating the blacklight_advanced_search gem, which rewrites a query with better boolean and parenthesis support. this meant that, when searching 'all_fields' in the advanced mode, the join query would be replaced with the generated advanced query and file_sets would then be excluded from the search results. in a collection like our newspapers and/or magazines, where the metadata is quite limited, this meant that a search that previously yielded results would now return nothing, which isn't useful at all!

at the same time, we've had an outstanding request from stakeholders to highlight search-result terms within documents, as it was viewed as redundant (and annoying) to search for a term, then go into a search result, and have to search for that term within the content as well. this was partially solved early-on by injecting the search term into the PDF.js viewer (see #14, 6f9fe7b in particular). however, solr (which is used for full-text extraction during indexing) and PDF.js (which generates a search index each time the file is loaded) differ in how they parse text, resulting in an item qualifying as a match in the search, but the PDF viewer not displaying the result in-text. this is particularly prominent in phrase queries, where the PDF.js search may not show results because a single space character was interpreted as multiple spaces. we can help remediate this by displaying match-highlighting in the search results. this functionality is provided by solr and blacklight but not readily available in hyrax because a) the content is only indexed + not stored, and b) a solr join will not return the matched content (as an SQL JOIN would) but instead return only the parent document as a match. however, this is possible when storing the content at the work level.

Screen Shot 2019-10-08 at 5 00 33 PM

so at this point, we have the following points for storing the extracted text alongside the work (rather than the file_set):

pros

  • better ability to integrate blacklight_advanced_search gem into a hyrax app
  • ability to display solr highlighting hits in search results

cons

  • more diskspace needed for the index (since we're storing full-text content)
  • breaks from samvera conventions

in case of emergencies

it should be noted that, even though we would be breaking from the samvera convention, we're only doing so within the solr index, which is seen as dynamic and not the source-of-truth for the repository. reverting the changes would require a patch, updating the specs, and running ActiveFedora::Base.reindex_everything.

an updated hyrax codebase may affect this negatively; certainly the move to valkyrie will require some updates. but, as best i could find, the only hyrax piece really affected is the search builder. nothing else seems to need to poke into the indexed full-text content.

as far as a larger index size, the indexer behavior can be modified to add the text as _timv (indexed, not stored) instead of _tsimv (stored and indexed). this would prevent us from showing contextual search results, but would still allow us to use the blacklight_advanced_search gem.


todo

  • add specs
  • expand this PR text

closes #145
closes #279

@rococodogs rococodogs changed the base branch from master to develop October 9, 2019 12:30
@rococodogs rococodogs changed the title display in-context highlighting on search results store extracted-text at the work level Oct 9, 2019
@codeclimate
Copy link

codeclimate bot commented Oct 9, 2019

Code Climate has analyzed commit 8db4987 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (100% is the threshold).

This pull request will bring the total coverage in the repository to 98.3% (0.0% change).

View more on Code Climate.

@rococodogs rococodogs added the 👩‍🔬 experiment for PRs - messing around! label Oct 9, 2019
@rococodogs rococodogs force-pushed the experiment/full-text-at-work-level branch from cfefc4d to c6f20e2 Compare October 11, 2019 18:56
this'll allow us to test this branch w/o having to reingest
everything that's live on stage
@rococodogs rococodogs marked this pull request as ready for review October 15, 2019 21:05
@rococodogs rococodogs merged commit 02c7eec into develop Oct 16, 2019
@rococodogs rococodogs deleted the experiment/full-text-at-work-level branch October 16, 2019 12:27
@rococodogs rococodogs mentioned this pull request Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👩‍🔬 experiment for PRs - messing around!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant