Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: OCR search not working #428

Closed
3 tasks
Tracked by #566
ShanaLMoore opened this issue Mar 21, 2023 · 11 comments
Closed
3 tasks
Tracked by #566

Spike: OCR search not working #428

ShanaLMoore opened this issue Mar 21, 2023 · 11 comments
Assignees

Comments

@ShanaLMoore
Copy link

ShanaLMoore commented Mar 21, 2023

Story

Per Katharine,

" I searched for a word in the UV and got no results despite seeing it on the page. This work has split into pages. Can you determine if something is wrong with the OCR search in viewer, or are we still waiting on the work to process?"

Acceptance Criteria

  • Can we duplicate the issue
  • Can we define the resolution for the specific work that isn't correct

Screenshots / Video

link to example: https://adl.s2.adventistdigitallibrary.org/concern/published_works/20214381_an_appeal_to_the_youth_funeral_address_of_henry_n_white?locale=en

generated ocr: 8b-txt.txt

Search for "Youth" returned no results:

image

When looking at this solr doc in rancher, I'm not seeing all_text_tsimv

image

Confirmed that this work type has ocr turned on: https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/published_work.rb#L31

Testing Instructions and Sample Files

  • create a work via bulkrax or manually
  • wait for sidekiq to complete and verify that ocr gets generated.
    • You can see the words it found by clicking on the action dropdown. Select download as txt

image

  • a user should be able to search the UV and catalog, for the generated ocr words

Notes

This issue was reported before sidekiq completed its jobs, so it may be a timing issue.

image

@ShanaLMoore
Copy link
Author

ShanaLMoore commented Apr 3, 2023

TODO:

validate if this is still an issue locally.

do we need to remediate the current data? (we should get guidance from Katharine on Tuesday) cc @jillpe hopefully we can confirm that it's valid by then.

@jillpe jillpe changed the title OCR search not working Spike: OCR search not working Apr 3, 2023
@ShanaLMoore ShanaLMoore assigned ShanaLMoore and laritakr and unassigned laritakr Apr 3, 2023
@kirkkwang
Copy link
Contributor

OCR seems to be working, though I tried the sample data and it's very inconsistent.

Very poorly OCR'd example (text and image):
08-txt.txt
a85818dc-08ae-4b8a-bbab-8fcb43004255-page3

Good OCR example (text and image):
cc-txt.txt
276df1c8-8628-462f-8f16-1afce89851e7-page1

Also, searching from the parent doesn't seem to be working:
image

image

jeremyf referenced this issue in notch8/iiif_print Apr 5, 2023
Prior to this commit, we assumed that a work's `id` was the mechanism
for always identifying a work in regards to it's ancestry/lineage.

However, for implmentations that incorproate slug behavior, that reality
is not always the case.

With this commit, we expose a configurable mechanism for altering the
ancestry functionality.

Related to:

- notch8/adventist-dl#354
- https://github.com/scientist-softserv/adventist-dl/issues/319
jeremyf referenced this issue in notch8/iiif_print Apr 5, 2023
Prior to this commit, we assumed that a work's `id` was the mechanism
for always identifying a work in regards to it's ancestry/lineage.

However, for implmentations that incorproate slug behavior, that reality
is not always the case.

With this commit, we expose a configurable mechanism for altering the
ancestry functionality.

Related to:

- notch8/adventist-dl#354
- https://github.com/scientist-softserv/adventist-dl/issues/319
@ShanaLMoore ShanaLMoore added blocked other work must be completed first and removed blocked other work must be completed first labels Apr 5, 2023
@ShanaLMoore
Copy link
Author

blocked because we may need to reindex staging

@ShanaLMoore
Copy link
Author

ShanaLMoore commented Apr 6, 2023

OLD DATA WILL NOT WORK until staging's reindexing succeeds. Until then, please only QA this using new data on the sdapi tenant.

cc @KatharineV @DiemBTran

@ShanaLMoore ShanaLMoore removed the blocked other work must be completed first label Apr 6, 2023
@ShanaLMoore
Copy link
Author

ShanaLMoore commented Apr 6, 2023

QA: Pass ✅

On sdapi I searched for the work "thesis" from the parent work. and it returned matches:

OCR Search

searched for the word 'thesis'

UV

Image

CATALOG

only the parent returned

Image

METADATA SEARCH

search for identifier 20121820

UV

Image

CATALOG

only returned parent search

Image

@KatharineV
Copy link
Collaborator

I'm testing this with SDAPI staging and a work I uploaded yesterday.

From the parent work, I don't see the option to download txt, but child works were generated, so I'm going to push forward on this test and see what happens.

Image

Child works do allow txt download.

Image

From the child work txt file, I see that the word "Christian" was recognized. I can search that in the parent work UI and receive results.

Image

I also searched for an unusual word that appears on the same child page--"discursive." The catalog search only returns the parent. So, all seems to be working as expected for the OCR search with this test work!

Image

@laritakr
Copy link
Contributor

laritakr commented Apr 7, 2023

From the parent work, I don't see the option to download txt, but child works were generated, so I'm going to push forward on this test and see what happens.

We do not OCR & other derivatives on the PDF so this is to be expected. From what I remember, we only split the pdf and make a thumbnail.

@KatharineV
Copy link
Collaborator

We do not OCR & other derivatives on the PDF so this is to be expected. From what I remember, we only split the pdf and make a thumbnail.

Thanks for that explanation. I was just following the testing instructions, which must be based on a previous configuration where the TXT derivative was created.

I also thought ticket 293 left some derivatives other than thumbnail for PDF. I'm rereading 293 right now, and it looks like other derivatives would be expected. I'm obviously not tracking the changes and what the final configuration should be, but I'll take my confusion to the other ticket.

@laritakr
Copy link
Contributor

laritakr commented Apr 7, 2023

Yes, one problem with all of these overlapping tickets is that things change rapidly and it's hard to keep track of everything! A lot of the derivatives were removed because they take so much time to generate during ingest.

Derivatives ARE configurable and will become more so with upcoming work. If something different is needed, that configuration will need to be clarified. It's easier to do that when you can see what is there and ask questions based on actual use.

@ShanaLMoore ShanaLMoore added the blocked other work must be completed first label Apr 7, 2023
@ShanaLMoore
Copy link
Author

ShanaLMoore commented Apr 7, 2023

Blocked from verifying: this requires a reindex.

unblocked
EDIT: best to test with new works. Would be nice to know if older imports work (those works must have ocr)

may I suggest: https://adl.b2.adventistdigitallibrary.org/concern/published_works/22253132_spectrum_winter_1969?locale=en

to find out what words you can search for, go to a child work's show page. In the action dropdown, select download as txt (ea-txt(2).txt) and open the file. those word should be available for OCR cc @KatharineV

Although not related to this ticket, if you perform an empty catalog search I'm aware that some of its children return. This is because that work was previously updated did not have a child property in its metadata. We would need to run another mass rake task to update the child works, so they know they need to be excluded from the catalog search return.

let me know if you have q's - that shouldn't be the focus of this ticket/qa though and isn't necessarily a bug. It's a todo task for us to handle old data, which I'll take note of. Running the script would likely take too long, before your demo. (see second screenshot, below)

Image

Image

@ShanaLMoore ShanaLMoore removed the blocked other work must be completed first label Apr 10, 2023
@KatharineV
Copy link
Collaborator

Tested on production with the work that Shana suggested above. A search for OCR'd text did return a result as expected. Yay!

Image

@jillpe jillpe closed this as completed Apr 10, 2023
@kirkkwang kirkkwang transferred this issue from notch8/adventist-dl May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

5 participants