Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESSI OCR #8

Closed
2 tasks
ShanaLMoore opened this issue Mar 24, 2023 · 6 comments
Closed
2 tasks

ESSI OCR #8

ShanaLMoore opened this issue Mar 24, 2023 · 6 comments
Assignees

Comments

@ShanaLMoore
Copy link

ShanaLMoore commented Mar 24, 2023

Summary

Essi already has ocr processing. They also have it configured to a button on the fileset show page, to regenerate ocr on the fly.

After the dev review meeting, we've decided to keep ESSI's OCR implementation for now. ref: https://docs.google.com/document/d/16s4x6rWcxK5Npwf9OmjUK5EAYssit_TeAHiTYzlMAk0/edit#

recommendation (kudos to kirk/jeremy): https://docs.google.com/document/d/1orq5ZIqKrJTr7Tt0EA1tYdb2vA1cnehkc7Z2cdCqkMI/edit#heading=h.9b0hmkp0u5zm

However, we may need to add or make changes to their indexing so that they can fully utilize the IiifPrint search functionalities.

Thoughts:

  1. should we keep both indexes? (older works will have ocr_text_tesi AND all_text_tsimv?) Is it not necessary since they store the values elsewhere? We'd need some process for extracting the string of their ocr_text_tesi value to populate all_text_tsimv.

Acceptance Criteria

  • IiifPrint's UV and catalog search functionality (ability to search for child metadata/ocr from the parent) should work with ESSI's new and old data. 🛑 Old data will require a reindex. Identify an old work that has ESSI ocr on it. Get a dev to ask Randy to reindex just that one for testing purposes.
  • IiifPrint should use ESSI's OCR processes.
    • A user will not be able to see OCR derivatives in the action drop down of a fileset (on a work's show page).

Testing Instructions

Create a work with a fileset that has text. sample file:

Image

Find an old work that has OCR.

Review the acceptance criteria

Notes

Currently we've commented out their process: https://github.com/scientist-softserv/essi/blob/main/lib/extensions/extensions.rb#L102-L104

Let's turn that back on, remove iiif_print text extraction, and make sure their ocr indexes work with our features for uv/catalog search.

@ShanaLMoore ShanaLMoore added this to essi Apr 7, 2023
@jillpe jillpe moved this to Ready for Development in essi Apr 14, 2023
@ShanaLMoore ShanaLMoore changed the title Spike: ESSI OCR ESSI OCR Apr 21, 2023
@kirkkwang kirkkwang self-assigned this Apr 25, 2023
@jillpe jillpe moved this from Ready for Development to Deploy to Staging in essi May 3, 2023
@ShanaLMoore
Copy link
Author

blocked - see the note in the acceptance criteria. We need to get Randy to reindex a few works in order to test that this works on their old data.

@ShanaLMoore
Copy link
Author

ShanaLMoore commented May 10, 2023

unblocked - Kirk will test this locally by importing works on main (to replicated old data), and then switching over to his changes to test that it still works.

@ShanaLMoore
Copy link
Author

ShanaLMoore commented May 10, 2023

QA RESULTS

Kirk replicated old data by switching to main and creating works that run their OCR.

BEFORE (on main):

image
NOTE: No all_text_tsimv and is_page_of_ssim fields

image

NOTE: searching child OCR from the parent UV does not return a result
image

AFTER (on test-iiif_print with a reindex)

image
NOTE: Now exists all_text_tsimv and 'is_page_of_ssim` fields

@ShanaLMoore
Copy link
Author

ShanaLMoore commented May 10, 2023

Consider creating a ticket for the following issue:

  1. User is unable to search for multiple words. Was this working before, on their main/old data?

EDIT: This has since been resolved

@jillpe jillpe moved this from Deploy to Staging to SoftServ QA in essi May 10, 2023
@jillpe jillpe moved this from SoftServ QA to Client QA in essi May 22, 2023
@ShanaLMoore
Copy link
Author

ShanaLMoore commented Jun 2, 2023

Notes from client iteration meeting 6/2/2023

re: https://assaydepot.slack.com/archives/C04CRCV3QNT/p1685719193099139

As discussed in the meeting, @jlhardes please consider moving the manual complex object creation issue you discovered into phase 2 as not re-indexing the file set is default Hyrax behavior.

Reported Problem:

When creating the parent/child relationships manually, the user is able to search for a file set term from the child record. But the parent record wasn't finding the term. This is because the file set record did not know about its grandparent; it knew about its parent only. We discovered this by visiting the solr dashboard with Daniel.

Reindexing the fileset by re-saving the file set record fixed this (it took a about a minute to resolve after save).

Next Steps:

Consider replicating this relationship via a bulkrax import. see @kirkkwang for help formatting the CSV correctly. However, if this doesn't work, we will be upgrading Bulkrax in phase 2 so it wouldn't be worth digging into at this time.

Possible Phase 2 solutions:

Create a call back so that file set reindex happens on save.

@jlhardes
Copy link
Collaborator

jlhardes commented Jun 6, 2023

The work for this issue is complete since we cannot see the OCR derivative in the drop-down menu on the Item List of a Work and we are able to use search within on old works that were present before iiif_print was added and new works that are created. Importing with parent/child relationships defined in the CSV is working to create one work with child works but fileset reindexing is still required in order for search within to function on those hierarchical works.

@jillpe jillpe closed this as completed Jun 6, 2023
@github-project-automation github-project-automation bot moved this from Client QA to Done in essi Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants