update Tesseract #35

GerHobbelt · 2019-08-08T09:22:36Z

https://github.com/tesseract-ocr/tesseract

GerHobbelt · 2019-09-09T21:06:25Z

Also consider offloading this to an external app entirely (as I have used different OCR applications in the past to cope with PDFs which the then-Tesseract/Qiqqa versions couldn't OCR properly).

See https://github.com/jbarlow83/OCRmyPDF for one example of this (which I encountered by way of https://tex.stackexchange.com/questions/11307/is-it-possible-to-produce-a-pdf-with-un-copyable-text while browsing around (La)TeX matters on a lazy afternoon).

IOW: see if we can get away with an entirely external OCR process which can deliver OCR/textualized PDF files for Qiqqa to process, so that Qiqqa can still make mark&copy available as before (every word is indexed with box coordinates i.e. position info in Lucene to help users find where in the PDF the sought phrase was located.

GerHobbelt · 2019-11-05T22:45:58Z

I'm learning something every day...

QiqqaOCR (at the time of this writing) already does something similar: Qiqqa attempts to use pdfdraw.exe -tt first to dump the text+coordinates per word from a given PDF, a.k.a. QiqqaOCR 'GROUP' mode.

When that doesn't fly, it uses Sorax PDF render library + custom region detection logic (#135; b0rk b0rk b0rk) + Tesseract v2 to perform an OCR action which also delivers words+coordinates for the given page, a.k.a. QiqqaOCR 'SINGLE' mode.

There's a NuPackage for Tesseract and C#, which would be a migration/upgrade path for the current antiquated Tesseract v2, but that website states it's for Tesseract v3 only (though there's apparently a 4.0 beta too: charlesw/tesseract#428) and I'd rather ride the bleeding edge with Tesseract 5, so it's gonna be commandline work instead, I guess.

And then, totally off topic of course, is my intent to run PDFs through other OCR engines — as an alternative for Tesseract — such as ABBYY FineReader and ReadIris, as those are the ones I use on a more regular basis.

References / stuff I looked at while looking at Tesseract migration

https://github.com/charlesw/tesseract/#user-content-tesseract-language-data
https://github.com/tesseract-ocr/tesseract#user-content-brief-history
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://github.com/UB-Mannheim/tesseract/wiki
https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ
Publish Tesseract 4.0 to nuget.org charlesw/tesseract#428
http://www.mythoughtspot.com/2015/01/06/pdf-to-tiff-to-txt-bash-script-automation/ (TIFF can be multipage, hence a single run is all it needs to produce an A/PDF using Tesseract)
https://github.com/LeoFCardoso/pdf2pdfocr
https://github.com/tesseract-ocr/tesseract/wiki/User-Projects-%E2%80%93-3rdParty
https://github.com/itext/itext7-dotnet (via https://www.codingame.com/playgrounds/10058/scanned-pdf-to-ocr-textsearchable-pdf-using-c )
https://www.codeproject.com/Articles/1303061/Convert-all-files-to-searchable-PDFs (nice! Script also converts Office docs to PDF)
http://guides.library.illinois.edu/c.php?g=347520&p=4121426
https://dantonnoriega.github.io/ultinomics.org/post/2016-03-29-pdf-text-convert-ocr-tesseract.html + https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode describe the woes around searchable text which contains ligatures: since Tesseract recognizes ligatures, this info is handy to have as this is useful for preprocessing OCR text and user searches hitting our Lucene index!)
http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
https://csharp.hotexamples.com/examples/Tesseract/TesseractEngine/-/php-tesseractengine-class-examples.html

GerHobbelt · 2019-11-05T22:54:27Z

As written in #135: upgrading to latest Tesseract implies:

Such a migration would of course impact the installer: maybe we should add code there to download the Tesseract installer and install it alongside Qiqqa — at least that would be the least size-increasing approach for the installer.

…new bits of technology to be integrated into Qiqqa as we upgrade the functional elements to modern standards (embedded browser, etc.): #2 #7 #34 #35

…new bits of technology to be integrated into Qiqqa as we upgrade the functional elements to modern standards (embedded browser, etc.): jimmejardine/qiqqa-open-source#2 jimmejardine/qiqqa-open-source#7 jimmejardine/qiqqa-open-source#34 jimmejardine/qiqqa-open-source#35

GerHobbelt mentioned this issue Aug 13, 2019

Guaranteed Backwards Compatibility #43

Open

GerHobbelt added the 🦸‍♀️enhancement🦸‍♂️ New feature or request label Oct 4, 2019

GerHobbelt added this to the Our Glorious Future milestone Oct 9, 2019

GerHobbelt mentioned this issue Nov 5, 2019

QiqqaOCR: internal region selector logic is still b0rked #135

Closed

GerHobbelt modified the milestones: Our Glorious Future, v82, v83 Nov 5, 2019

GerHobbelt mentioned this issue Mar 27, 2020

Critical bug: Qiqqa does not report major failures in texifying a file; blames it on OCR! #193

Open

GerHobbelt mentioned this issue Aug 9, 2020

GUI fixes and refactoring example #211

Closed

GerHobbelt mentioned this issue Dec 29, 2020

"Unexpected problem in Qiqqa" (randomly) during a quick scrolling or zooming a pdf #280

Open

This was referenced Feb 27, 2021

several PDFs caused Qiqqa to run indefinitely after closing it #305

Open

Qiqqa error pops up "unexpected problem in qiqqa" v83.0.7656.6401 - I sent you zipped logs to email #304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update Tesseract #35

update Tesseract #35

GerHobbelt commented Aug 8, 2019

GerHobbelt commented Sep 9, 2019

GerHobbelt commented Nov 5, 2019

GerHobbelt commented Nov 5, 2019

update Tesseract #35

update Tesseract #35

Comments

GerHobbelt commented Aug 8, 2019

GerHobbelt commented Sep 9, 2019

GerHobbelt commented Nov 5, 2019

I'm learning something every day...

References / stuff I looked at while looking at Tesseract migration

GerHobbelt commented Nov 5, 2019