-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store extracted-text at the work level #298
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rococodogs
changed the title
display in-context highlighting on search results
store extracted-text at the work level
Oct 9, 2019
Code Climate has analyzed commit 8db4987 and detected 0 issues on this pull request. The test coverage on the diff in this pull request is 100.0% (100% is the threshold). This pull request will bring the total coverage in the repository to 98.3% (0.0% change). View more on Code Climate. |
Closed
rococodogs
force-pushed
the
experiment/full-text-at-work-level
branch
from
October 11, 2019 18:56
cfefc4d
to
c6f20e2
Compare
this'll allow us to test this branch w/o having to reingest everything that's live on stage
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
tl;dr
this moves where a work's extracted text is stored and, while opening up possibilities for enhancement and integration of some features, diverges from samvera common practices.
the headline change introduced here is that we're storing extracted-text (at the index level) in the work's record. this has the potential to introduce some unforseen problems, as it breaks from the common samvera practice of indexing (and not storing) extracted-text at the file-set
level.
some background
when ingesting a file into fedora, the underlying samvera framework ensures that, if available, text content is extracted and stored as a file alongside the asset. this task is defined within the
hydra-derivatives
gem (see Hydra::Derivatives::PersistBasicContainedOutputFileService). in the hydra-works gem, this content is accessible as part of the file_set model, which delegates the text to the attached files (see Hydra::Works::ContainedFiles). at index time, the extracted-text is indexed, but not stored (presumably to prevent duplication, as the content is already stored in fedora) (see Hyrax::FileSetIndexer).how do we ensure that full-text content is searched when using 'all_fields'? hyrax uses a solr join query to attach file-sets to their works. this is defined in the [Hyrax::CatalogSearchBuilder], which moves the user-provided query to its own url parameter and replaces the
q:
parameter with the solr join. so, a search with the parameters:is transformed to:
this worked perfectly fine until we tried integrating the blacklight_advanced_search gem, which rewrites a query with better boolean and parenthesis support. this meant that, when searching 'all_fields' in the advanced mode, the join query would be replaced with the generated advanced query and file_sets would then be excluded from the search results. in a collection like our newspapers and/or magazines, where the metadata is quite limited, this meant that a search that previously yielded results would now return nothing, which isn't useful at all!
at the same time, we've had an outstanding request from stakeholders to highlight search-result terms within documents, as it was viewed as redundant (and annoying) to search for a term, then go into a search result, and have to search for that term within the content as well. this was partially solved early-on by injecting the search term into the PDF.js viewer (see #14, 6f9fe7b in particular). however, solr (which is used for full-text extraction during indexing) and PDF.js (which generates a search index each time the file is loaded) differ in how they parse text, resulting in an item qualifying as a match in the search, but the PDF viewer not displaying the result in-text. this is particularly prominent in phrase queries, where the PDF.js search may not show results because a single space character was interpreted as multiple spaces. we can help remediate this by displaying match-highlighting in the search results. this functionality is provided by solr and blacklight but not readily available in hyrax because a) the content is only indexed + not stored, and b) a solr join will not return the matched content (as an SQL JOIN would) but instead return only the parent document as a match. however, this is possible when storing the content at the work level.
so at this point, we have the following points for storing the extracted text alongside the work (rather than the file_set):
pros
blacklight_advanced_search
gem into a hyrax appcons
in case of emergencies
it should be noted that, even though we would be breaking from the samvera convention, we're only doing so within the solr index, which is seen as dynamic and not the source-of-truth for the repository. reverting the changes would require a patch, updating the specs, and running
ActiveFedora::Base.reindex_everything
.an updated hyrax codebase may affect this negatively; certainly the move to valkyrie will require some updates. but, as best i could find, the only hyrax piece really affected is the search builder. nothing else seems to need to poke into the indexed full-text content.
as far as a larger index size, the indexer behavior can be modified to add the text as
_timv
(indexed, not stored) instead of_tsimv
(stored and indexed). this would prevent us from showing contextual search results, but would still allow us to use theblacklight_advanced_search
gem.todo
closes #145
closes #279