Output hocr and pdf #20

robszabo · 2018-01-20T16:13:45Z

First of all, thanks for this great functionality by simplifying the usage of tesseract from R and the possibility to download the language files with a single line of r-code. This is powerful!

It would also be really nice in case it would be possible to output an ocr'ed document in hocr format or as a searchable pdf directly. This would make the package even more simple to use for people (like me) that doesn't have the skills to configure or override the settings in tesseract.
With an additional parameter "output" to the ocr function that could be one of {"text", "hocr" or "pdf"} it could look like this:
out <- ocr("test.tif", engine = tesseract("swe"), output = "hocr")

I think this would make this R package very strong in terms of how widely it could be used.

Again, thanks for a great work!

jeroen · 2018-01-26T10:04:36Z

Thank you for the suggestion. I have added a parameter to get the HOCR output. Can you test this?

devtools::install_github("ropensci/tesseract")

robszabo · 2018-01-26T11:34:00Z

Works like a charm! Thanks a lot! :-)

jeroen · 2018-01-26T14:19:26Z

on cran now

jeroen added a commit that referenced this issue Jan 26, 2018

Add support for HOCR output (#20)

3644eb8

jeroen closed this as completed Jan 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output hocr and pdf #20

Output hocr and pdf #20

robszabo commented Jan 20, 2018

jeroen commented Jan 26, 2018 •

edited

Loading

robszabo commented Jan 26, 2018

jeroen commented Jan 26, 2018

Output hocr and pdf #20

Output hocr and pdf #20

Comments

robszabo commented Jan 20, 2018

jeroen commented Jan 26, 2018 • edited Loading

robszabo commented Jan 26, 2018

jeroen commented Jan 26, 2018

jeroen commented Jan 26, 2018 •

edited

Loading