-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: OCR search not working #428
Comments
TODO:validate if this is still an issue locally. do we need to remediate the current data? (we should get guidance from Katharine on Tuesday) cc @jillpe hopefully we can confirm that it's valid by then. |
OCR seems to be working, though I tried the sample data and it's very inconsistent. Very poorly OCR'd example (text and image): Good OCR example (text and image): |
Prior to this commit, we assumed that a work's `id` was the mechanism for always identifying a work in regards to it's ancestry/lineage. However, for implmentations that incorproate slug behavior, that reality is not always the case. With this commit, we expose a configurable mechanism for altering the ancestry functionality. Related to: - notch8/adventist-dl#354 - https://github.com/scientist-softserv/adventist-dl/issues/319
Prior to this commit, we assumed that a work's `id` was the mechanism for always identifying a work in regards to it's ancestry/lineage. However, for implmentations that incorproate slug behavior, that reality is not always the case. With this commit, we expose a configurable mechanism for altering the ancestry functionality. Related to: - notch8/adventist-dl#354 - https://github.com/scientist-softserv/adventist-dl/issues/319
blocked because we may need to reindex staging |
OLD DATA WILL NOT WORK until staging's reindexing succeeds. Until then, please only QA this using new data on the sdapi tenant. |
QA: Pass ✅ On sdapi I searched for the work "thesis" from the parent work. and it returned matches: OCR Searchsearched for the word 'thesis' UVCATALOGonly the parent returned METADATA SEARCHsearch for identifier 20121820 UVCATALOGonly returned parent search |
I'm testing this with SDAPI staging and a work I uploaded yesterday. From the parent work, I don't see the option to download txt, but child works were generated, so I'm going to push forward on this test and see what happens. Child works do allow txt download. From the child work txt file, I see that the word "Christian" was recognized. I can search that in the parent work UI and receive results. I also searched for an unusual word that appears on the same child page--"discursive." The catalog search only returns the parent. So, all seems to be working as expected for the OCR search with this test work! |
We do not OCR & other derivatives on the PDF so this is to be expected. From what I remember, we only split the pdf and make a thumbnail. |
Thanks for that explanation. I was just following the testing instructions, which must be based on a previous configuration where the TXT derivative was created. I also thought ticket 293 left some derivatives other than thumbnail for PDF. I'm rereading 293 right now, and it looks like other derivatives would be expected. I'm obviously not tracking the changes and what the final configuration should be, but I'll take my confusion to the other ticket. |
Yes, one problem with all of these overlapping tickets is that things change rapidly and it's hard to keep track of everything! A lot of the derivatives were removed because they take so much time to generate during ingest. Derivatives ARE configurable and will become more so with upcoming work. If something different is needed, that configuration will need to be clarified. It's easier to do that when you can see what is there and ask questions based on actual use. |
Blocked from verifying: this requires a reindex. unblocked may I suggest: https://adl.b2.adventistdigitallibrary.org/concern/published_works/22253132_spectrum_winter_1969?locale=en to find out what words you can search for, go to a child work's show page. In the action dropdown, select download as txt (ea-txt(2).txt) and open the file. those word should be available for OCR cc @KatharineV Although not related to this ticket, if you perform an empty catalog search I'm aware that some of its children return. This is because that work was previously updated did not have a child property in its metadata. We would need to run another mass rake task to update the child works, so they know they need to be excluded from the catalog search return. let me know if you have q's - that shouldn't be the focus of this ticket/qa though and isn't necessarily a bug. It's a todo task for us to handle old data, which I'll take note of. Running the script would likely take too long, before your demo. (see second screenshot, below) |
Tested on production with the work that Shana suggested above. A search for OCR'd text did return a result as expected. Yay! |
Story
Per Katharine,
" I searched for a word in the UV and got no results despite seeing it on the page. This work has split into pages. Can you determine if something is wrong with the OCR search in viewer, or are we still waiting on the work to process?"
Acceptance Criteria
Screenshots / Video
link to example: https://adl.s2.adventistdigitallibrary.org/concern/published_works/20214381_an_appeal_to_the_youth_funeral_address_of_henry_n_white?locale=en
generated ocr: 8b-txt.txt
Search for "Youth" returned no results:
When looking at this solr doc in rancher, I'm not seeing all_text_tsimv
Confirmed that this work type has ocr turned on: https://github.com/scientist-softserv/adventist-dl/blob/main/app/models/published_work.rb#L31
Testing Instructions and Sample Files
Notes
This issue was reported before sidekiq completed its jobs, so it may be a timing issue.
The text was updated successfully, but these errors were encountered: