Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output hocr and pdf #20

Closed
robszabo opened this issue Jan 20, 2018 · 3 comments
Closed

Output hocr and pdf #20

robszabo opened this issue Jan 20, 2018 · 3 comments

Comments

@robszabo
Copy link

First of all, thanks for this great functionality by simplifying the usage of tesseract from R and the possibility to download the language files with a single line of r-code. This is powerful!

It would also be really nice in case it would be possible to output an ocr'ed document in hocr format or as a searchable pdf directly. This would make the package even more simple to use for people (like me) that doesn't have the skills to configure or override the settings in tesseract.
With an additional parameter "output" to the ocr function that could be one of {"text", "hocr" or "pdf"} it could look like this:
out <- ocr("test.tif", engine = tesseract("swe"), output = "hocr")

I think this would make this R package very strong in terms of how widely it could be used.

Again, thanks for a great work!

jeroen added a commit that referenced this issue Jan 26, 2018
@jeroen
Copy link
Member

jeroen commented Jan 26, 2018

Thank you for the suggestion. I have added a parameter to get the HOCR output. Can you test this?

devtools::install_github("ropensci/tesseract")

@robszabo
Copy link
Author

Works like a charm! Thanks a lot! :-)

@jeroen jeroen closed this as completed Jan 26, 2018
@jeroen
Copy link
Member

jeroen commented Jan 26, 2018

on cran now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants