ESSI OCR #8

ShanaLMoore · 2023-03-24T17:45:45Z

Summary

Essi already has ocr processing. They also have it configured to a button on the fileset show page, to regenerate ocr on the fly.

After the dev review meeting, we've decided to keep ESSI's OCR implementation for now. ref: https://docs.google.com/document/d/16s4x6rWcxK5Npwf9OmjUK5EAYssit_TeAHiTYzlMAk0/edit#

recommendation (kudos to kirk/jeremy): https://docs.google.com/document/d/1orq5ZIqKrJTr7Tt0EA1tYdb2vA1cnehkc7Z2cdCqkMI/edit#heading=h.9b0hmkp0u5zm

However, we may need to add or make changes to their indexing so that they can fully utilize the IiifPrint search functionalities.

Thoughts:

should we keep both indexes? (older works will have ocr_text_tesi AND all_text_tsimv?) Is it not necessary since they store the values elsewhere? We'd need some process for extracting the string of their ocr_text_tesi value to populate all_text_tsimv.

Acceptance Criteria

IiifPrint's UV and catalog search functionality (ability to search for child metadata/ocr from the parent) should work with ESSI's new and old data. 🛑 Old data will require a reindex. Identify an old work that has ESSI ocr on it. Get a dev to ask Randy to reindex just that one for testing purposes.
IiifPrint should use ESSI's OCR processes.
- A user will not be able to see OCR derivatives in the action drop down of a fileset (on a work's show page).

Testing Instructions

Create a work with a fileset that has text. sample file:

Find an old work that has OCR.

Review the acceptance criteria

Notes

Currently we've commented out their process: https://github.com/scientist-softserv/essi/blob/main/lib/extensions/extensions.rb#L102-L104

Let's turn that back on, remove iiif_print text extraction, and make sure their ocr indexes work with our features for uv/catalog search.

ShanaLMoore · 2023-05-10T14:47:43Z

blocked - see the note in the acceptance criteria. We need to get Randy to reindex a few works in order to test that this works on their old data.

ShanaLMoore · 2023-05-10T16:31:32Z

unblocked - Kirk will test this locally by importing works on main (to replicated old data), and then switching over to his changes to test that it still works.

ShanaLMoore · 2023-05-10T17:54:21Z

QA RESULTS

Kirk replicated old data by switching to main and creating works that run their OCR.

BEFORE (on `main`):

NOTE: No all_text_tsimv and is_page_of_ssim fields

NOTE: searching child OCR from the parent UV does not return a result

AFTER (on `test-iiif_print` with a reindex)

NOTE: Now exists all_text_tsimv and 'is_page_of_ssim` fields

ShanaLMoore · 2023-05-10T18:38:33Z

Consider creating a ticket for the following issue:

User is unable to search for multiple words. Was this working before, on their main/old data?

EDIT: This has since been resolved

ShanaLMoore · 2023-06-02T17:35:18Z

Notes from client iteration meeting 6/2/2023

re: https://assaydepot.slack.com/archives/C04CRCV3QNT/p1685719193099139

As discussed in the meeting, @jlhardes please consider moving the manual complex object creation issue you discovered into phase 2 as not re-indexing the file set is default Hyrax behavior.

Reported Problem:

When creating the parent/child relationships manually, the user is able to search for a file set term from the child record. But the parent record wasn't finding the term. This is because the file set record did not know about its grandparent; it knew about its parent only. We discovered this by visiting the solr dashboard with Daniel.

Reindexing the fileset by re-saving the file set record fixed this (it took a about a minute to resolve after save).

Next Steps:

Consider replicating this relationship via a bulkrax import. see @kirkkwang for help formatting the CSV correctly. However, if this doesn't work, we will be upgrading Bulkrax in phase 2 so it wouldn't be worth digging into at this time.

Possible Phase 2 solutions:

Create a call back so that file set reindex happens on save.

jlhardes · 2023-06-06T19:12:27Z

The work for this issue is complete since we cannot see the OCR derivative in the drop-down menu on the Item List of a Work and we are able to use search within on old works that were present before iiif_print was added and new works that are created. Importing with parent/child relationships defined in the CSV is working to create one work with child works but fileset reindexing is still required in order for search within to function on those hierarchical works.

ShanaLMoore added this to essi Apr 7, 2023

jillpe moved this to Ready for Development in essi Apr 14, 2023

ShanaLMoore changed the title ~~Spike: ESSI OCR~~ ESSI OCR Apr 21, 2023

kirkkwang self-assigned this Apr 25, 2023

This was referenced Apr 27, 2023

Add configuration for all_text indexing notch8/iiif_print#228

Merged

Ocr implementation IU-Libraries-Joint-Development/essi#537

Merged

jillpe moved this from Ready for Development to Deploy to Staging in essi May 3, 2023

ShanaLMoore added the Blocked label May 10, 2023

ShanaLMoore removed the Blocked label May 10, 2023

jillpe moved this from Deploy to Staging to SoftServ QA in essi May 10, 2023

jillpe moved this from SoftServ QA to Client QA in essi May 22, 2023

jillpe closed this as completed Jun 6, 2023

github-project-automation bot moved this from Client QA to Done in essi Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESSI OCR #8

ESSI OCR #8

ShanaLMoore commented Mar 24, 2023 •

edited

Loading

ShanaLMoore commented May 10, 2023

ShanaLMoore commented May 10, 2023 •

edited

Loading

ShanaLMoore commented May 10, 2023 •

edited by kirkkwang

Loading

ShanaLMoore commented May 10, 2023 •

edited

Loading

ShanaLMoore commented Jun 2, 2023 •

edited

Loading

jlhardes commented Jun 6, 2023

ESSI OCR #8

ESSI OCR #8

Comments

ShanaLMoore commented Mar 24, 2023 • edited Loading

Summary

Acceptance Criteria

Testing Instructions

Notes

ShanaLMoore commented May 10, 2023

ShanaLMoore commented May 10, 2023 • edited Loading

ShanaLMoore commented May 10, 2023 • edited by kirkkwang Loading

QA RESULTS

BEFORE (on main):

AFTER (on test-iiif_print with a reindex)

ShanaLMoore commented May 10, 2023 • edited Loading

ShanaLMoore commented Jun 2, 2023 • edited Loading

jlhardes commented Jun 6, 2023

ShanaLMoore commented Mar 24, 2023 •

edited

Loading

ShanaLMoore commented May 10, 2023 •

edited

Loading

ShanaLMoore commented May 10, 2023 •

edited by kirkkwang

Loading

BEFORE (on `main`):

AFTER (on `test-iiif_print` with a reindex)

ShanaLMoore commented May 10, 2023 •

edited

Loading

ShanaLMoore commented Jun 2, 2023 •

edited

Loading