-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESSI OCR #8
Comments
blocked - see the note in the acceptance criteria. We need to get Randy to reindex a few works in order to test that this works on their old data. |
unblocked - Kirk will test this locally by importing works on main (to replicated old data), and then switching over to his changes to test that it still works. |
Consider creating a ticket for the following issue:
EDIT: This has since been resolved |
Notes from client iteration meeting 6/2/2023 re: https://assaydepot.slack.com/archives/C04CRCV3QNT/p1685719193099139 As discussed in the meeting, @jlhardes please consider moving the manual complex object creation issue you discovered into phase 2 as not re-indexing the file set is default Hyrax behavior. Reported Problem: When creating the parent/child relationships manually, the user is able to search for a file set term from the child record. But the parent record wasn't finding the term. This is because the file set record did not know about its grandparent; it knew about its parent only. We discovered this by visiting the solr dashboard with Daniel. Reindexing the fileset by re-saving the file set record fixed this (it took a about a minute to resolve after save). Next Steps: Consider replicating this relationship via a bulkrax import. see @kirkkwang for help formatting the CSV correctly. However, if this doesn't work, we will be upgrading Bulkrax in phase 2 so it wouldn't be worth digging into at this time. Possible Phase 2 solutions: Create a call back so that file set reindex happens on save. |
The work for this issue is complete since we cannot see the OCR derivative in the drop-down menu on the Item List of a Work and we are able to use search within on old works that were present before iiif_print was added and new works that are created. Importing with parent/child relationships defined in the CSV is working to create one work with child works but fileset reindexing is still required in order for search within to function on those hierarchical works. |
Summary
Essi already has ocr processing. They also have it configured to a button on the fileset show page, to regenerate ocr on the fly.
After the dev review meeting, we've decided to keep ESSI's OCR implementation for now. ref: https://docs.google.com/document/d/16s4x6rWcxK5Npwf9OmjUK5EAYssit_TeAHiTYzlMAk0/edit#
recommendation (kudos to kirk/jeremy): https://docs.google.com/document/d/1orq5ZIqKrJTr7Tt0EA1tYdb2vA1cnehkc7Z2cdCqkMI/edit#heading=h.9b0hmkp0u5zm
However, we may need to add or make changes to their indexing so that they can fully utilize the IiifPrint search functionalities.
Thoughts:
Acceptance Criteria
Testing Instructions
Create a work with a fileset that has text. sample file:
Find an old work that has OCR.
Review the acceptance criteria
Notes
Currently we've commented out their process: https://github.com/scientist-softserv/essi/blob/main/lib/extensions/extensions.rb#L102-L104
Let's turn that back on, remove iiif_print text extraction, and make sure their ocr indexes work with our features for uv/catalog search.
The text was updated successfully, but these errors were encountered: