-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update Tesseract #35
Comments
Also consider offloading this to an external app entirely (as I have used different OCR applications in the past to cope with PDFs which the then-Tesseract/Qiqqa versions couldn't OCR properly). See https://github.com/jbarlow83/OCRmyPDF for one example of this (which I encountered by way of https://tex.stackexchange.com/questions/11307/is-it-possible-to-produce-a-pdf-with-un-copyable-text while browsing around (La)TeX matters on a lazy afternoon). IOW: see if we can get away with an entirely external OCR process which can deliver OCR/textualized PDF files for Qiqqa to process, so that Qiqqa can still make mark© available as before (every word is indexed with box coordinates i.e. position info in Lucene to help users find where in the PDF the sought phrase was located. |
I'm learning something every day...QiqqaOCR (at the time of this writing) already does something similar: Qiqqa attempts to use When that doesn't fly, it uses Sorax PDF render library + custom region detection logic (#135; b0rk b0rk b0rk) + Tesseract v2 to perform an OCR action which also delivers words+coordinates for the given page, a.k.a. QiqqaOCR 'SINGLE' mode. There's a NuPackage for Tesseract and C#, which would be a migration/upgrade path for the current antiquated Tesseract v2, but that website states it's for Tesseract v3 only (though there's apparently a 4.0 beta too: charlesw/tesseract#428) and I'd rather ride the bleeding edge with Tesseract 5, so it's gonna be commandline work instead, I guess.
References / stuff I looked at while looking at Tesseract migration
|
As written in #135: upgrading to latest Tesseract implies:
|
…new bits of technology to be integrated into Qiqqa as we upgrade the functional elements to modern standards (embedded browser, etc.): jimmejardine/qiqqa-open-source#2 jimmejardine/qiqqa-open-source#7 jimmejardine/qiqqa-open-source#34 jimmejardine/qiqqa-open-source#35
https://github.com/tesseract-ocr/tesseract
The text was updated successfully, but these errors were encountered: