Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update Tesseract #35

Open
GerHobbelt opened this issue Aug 8, 2019 · 3 comments
Open

update Tesseract #35

GerHobbelt opened this issue Aug 8, 2019 · 3 comments
Labels
Milestone

Comments

@GerHobbelt
Copy link
Collaborator

https://github.com/tesseract-ocr/tesseract

@GerHobbelt
Copy link
Collaborator Author

Also consider offloading this to an external app entirely (as I have used different OCR applications in the past to cope with PDFs which the then-Tesseract/Qiqqa versions couldn't OCR properly).

See https://github.com/jbarlow83/OCRmyPDF for one example of this (which I encountered by way of https://tex.stackexchange.com/questions/11307/is-it-possible-to-produce-a-pdf-with-un-copyable-text while browsing around (La)TeX matters on a lazy afternoon).

IOW: see if we can get away with an entirely external OCR process which can deliver OCR/textualized PDF files for Qiqqa to process, so that Qiqqa can still make mark&copy available as before (every word is indexed with box coordinates i.e. position info in Lucene to help users find where in the PDF the sought phrase was located.

@GerHobbelt GerHobbelt added the 🦸‍♀️enhancement🦸‍♂️ New feature or request label Oct 4, 2019
@GerHobbelt GerHobbelt added this to the Our Glorious Future milestone Oct 9, 2019
@GerHobbelt
Copy link
Collaborator Author

I'm learning something every day...

QiqqaOCR (at the time of this writing) already does something similar: Qiqqa attempts to use pdfdraw.exe -tt first to dump the text+coordinates per word from a given PDF, a.k.a. QiqqaOCR 'GROUP' mode.

When that doesn't fly, it uses Sorax PDF render library + custom region detection logic (#135; b0rk b0rk b0rk) + Tesseract v2 to perform an OCR action which also delivers words+coordinates for the given page, a.k.a. QiqqaOCR 'SINGLE' mode.

There's a NuPackage for Tesseract and C#, which would be a migration/upgrade path for the current antiquated Tesseract v2, but that website states it's for Tesseract v3 only (though there's apparently a 4.0 beta too: charlesw/tesseract#428) and I'd rather ride the bleeding edge with Tesseract 5, so it's gonna be commandline work instead, I guess.

And then, totally off topic of course, is my intent to run PDFs through other OCR engines — as an alternative for Tesseract — such as ABBYY FineReader and ReadIris, as those are the ones I use on a more regular basis.

References / stuff I looked at while looking at Tesseract migration

@GerHobbelt
Copy link
Collaborator Author

As written in #135: upgrading to latest Tesseract implies:

Such a migration would of course impact the installer: maybe we should add code there to download the Tesseract installer and install it alongside Qiqqa — at least that would be the least size-increasing approach for the installer.

@GerHobbelt GerHobbelt modified the milestones: Our Glorious Future, v82, v83 Nov 5, 2019
GerHobbelt added a commit that referenced this issue Apr 21, 2020
…new bits of technology to be integrated into Qiqqa as we upgrade the functional elements to modern standards (embedded browser, etc.): #2 #7 #34 #35
GerHobbelt added a commit to GerHobbelt/qiqqa-technology-tests that referenced this issue Sep 14, 2022
…new bits of technology to be integrated into Qiqqa as we upgrade the functional elements to modern standards (embedded browser, etc.): jimmejardine/qiqqa-open-source#2 jimmejardine/qiqqa-open-source#7 jimmejardine/qiqqa-open-source#34 jimmejardine/qiqqa-open-source#35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant