Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update documentation #73

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,8 @@
^\.Rproj\.user$
^src/Makevars$
^windows
\.pdf$
\.png$
^vignettes/.*\.png$
\.webp$
\.jpeg$
\.o$
\.dll$
^\.travis\.yml$
Expand Down
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,9 @@
*.so
*.dll
*.a
*.txt
*.pdf
*.png
*.webp
*.jpeg
vignettes/*.png
.Rproj.user
.Rhistory
inst/tessdata
Expand Down
10 changes: 8 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,14 @@ Package: tesseract
Type: Package
Title: Open Source OCR Engine
Version: 5.2.1
Authors@R: person("Jeroen", "Ooms", role = c("aut", "cre"), email = "[email protected]",
comment = c(ORCID = "0000-0002-4035-0289"))
Authors@R: c(person("Jeroen", "Ooms",
role = c("aut", "cre"),
email = "[email protected]",
comment = c(ORCID = "0000-0002-4035-0289")),
person("Mauricio", "Vargas Sepulveda",
role = "ctb",
email = "[email protected]",
comment = c(ORCID = "0000-0003-1017-7574")))
Description: Bindings to 'Tesseract':
a powerful optical character recognition (OCR) engine that supports over 100 languages.
The engine is highly configurable in order to tune the detection algorithms and
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ export(tesseract_info)
export(tesseract_params)
importFrom(Rcpp,sourceCpp)
useDynLib(tesseract)
useDynLib(tesseract, .registration = TRUE)
22 changes: 11 additions & 11 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,46 +2,46 @@
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

tesseract_config <- function() {
.Call('_tesseract_tesseract_config', PACKAGE = 'tesseract')
.Call(`_tesseract_tesseract_config`)
}

tesseract_engine_internal <- function(datapath, language, confpaths, opt_names, opt_values) {
.Call('_tesseract_tesseract_engine_internal', PACKAGE = 'tesseract', datapath, language, confpaths, opt_names, opt_values)
.Call(`_tesseract_tesseract_engine_internal`, datapath, language, confpaths, opt_names, opt_values)
}

tesseract_engine_set_variable <- function(ptr, name, value) {
.Call('_tesseract_tesseract_engine_set_variable', PACKAGE = 'tesseract', ptr, name, value)
.Call(`_tesseract_tesseract_engine_set_variable`, ptr, name, value)
}

validate_params <- function(params) {
.Call('_tesseract_validate_params', PACKAGE = 'tesseract', params)
.Call(`_tesseract_validate_params`, params)
}

engine_info_internal <- function(ptr) {
.Call('_tesseract_engine_info_internal', PACKAGE = 'tesseract', ptr)
.Call(`_tesseract_engine_info_internal`, ptr)
}

print_params <- function(filename) {
.Call('_tesseract_print_params', PACKAGE = 'tesseract', filename)
.Call(`_tesseract_print_params`, filename)
}

get_param_values <- function(ptr, params) {
.Call('_tesseract_get_param_values', PACKAGE = 'tesseract', ptr, params)
.Call(`_tesseract_get_param_values`, ptr, params)
}

ocr_raw <- function(input, ptr, HOCR = FALSE) {
.Call('_tesseract_ocr_raw', PACKAGE = 'tesseract', input, ptr, HOCR)
.Call(`_tesseract_ocr_raw`, input, ptr, HOCR)
}

ocr_file <- function(file, ptr, HOCR = FALSE) {
.Call('_tesseract_ocr_file', PACKAGE = 'tesseract', file, ptr, HOCR)
.Call(`_tesseract_ocr_file`, file, ptr, HOCR)
}

ocr_raw_data <- function(input, ptr) {
.Call('_tesseract_ocr_raw_data', PACKAGE = 'tesseract', input, ptr)
.Call(`_tesseract_ocr_raw_data`, input, ptr)
}

ocr_file_data <- function(file, ptr) {
.Call('_tesseract_ocr_file_data', PACKAGE = 'tesseract', file, ptr)
.Call(`_tesseract_ocr_file_data`, file, ptr)
}

12 changes: 6 additions & 6 deletions R/ocr.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#'
#' Extract text from an image. Requires that you have training data for the language you
#' are reading. Works best for images with high contrast, little noise and horizontal text.
#' See [tesseract wiki](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) and
#' See [tesseract wiki](https://github.com/tesseract-ocr/tessdoc) and
#' our package vignette for image preprocessing tips.
#'
#' The `ocr()` function returns plain text by default, or hOCR text if hOCR is set to `TRUE`.
Expand All @@ -18,15 +18,15 @@
#' @param HOCR if `TRUE` return results as HOCR xml instead of plain text
#' @rdname ocr
#' @references [Tesseract: Improving Quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)
#' @importFrom Rcpp sourceCpp
#' @examples # Simple example
#' text <- ocr("https://jeroen.github.io/images/testocr.png")
#' file <- system.file("examples", "testocr.png", package = "tesseract")
#' text <- ocr(file)
#' cat(text)
#'
#' xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)
#' xml <- ocr(file, HOCR = TRUE)
#' cat(xml)
#'
#' df <- ocr_data("https://jeroen.github.io/images/testocr.png")
#' df <- ocr_data(file)
#' print(df)
#'
#' \donttest{
Expand All @@ -35,7 +35,7 @@
#' orig <- pdftools::pdf_text("R-intro.pdf")[1]
#'
#' # Render pdf to png image
#' img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)
#' img_file <- pdftools::pdf_convert("R-intro.pdf", format = "tiff", pages = 1, dpi = 400)
#' unlink("R-intro.pdf")
#'
#' # Extract text from png image
Expand Down
2 changes: 1 addition & 1 deletion R/onload.R
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ check_training_data <- function(){
tryCatch(tesseract(), error = function(e){
warning("Unable to find English training data", call. = FALSE)
os <- utils::sessionInfo()$running
if(isTRUE(grepl("ubuntu|debian", os, TRUE))){
if (isTRUE(grepl("ubuntu|debian|pop", os, TRUE))) {
stop("DEBIAN / UBUNTU: Please run: apt-get install tesseract-ocr-eng")
}
})
Expand Down
3 changes: 2 additions & 1 deletion R/tessdata.R
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@
#' if(is.na(match("fra", tesseract_info()$available)))
#' tesseract_download("fra", model = 'best')
#' french <- tesseract("fra")
#' text <- ocr("https://jeroen.github.io/images/french_text.png", engine = french)
#' file <- system.file("examples", "french.png", package = "tesseract")
#' text <- ocr(file, engine = french)
#' cat(text)
#' }
tesseract_download <- function(lang, datapath = NULL, model = c("fast", "best"), progress = interactive()) {
Expand Down
12 changes: 12 additions & 0 deletions R/tesseract-package.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#' @title Open Source OCR Engine
#'
#' @description
#' Bindings to 'Tesseract':
#' a powerful optical character recognition (OCR) engine that supports over 100
#' languages. The engine is highly configurable in order to tune the detection
#' algorithms and obtain the best possible results.
#'
#' @name tesseract-package
#' @importFrom Rcpp sourceCpp
#' @useDynLib tesseract, .registration = TRUE
"_PACKAGE"
6 changes: 3 additions & 3 deletions R/tesseract.R
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ tesseract <- local({
language <- as.character(language)
configs <- as.character(configs)
options <- as.list(options)
if(isTRUE(cache)){
if(isTRUE(cache)) {
key <- digest::digest(list(language, datapath, configs, options))
if(is.null(store[[key]])){
ptr <- tesseract_engine(datapath, language, configs, options)
assign(key, ptr, store);
assign(key, ptr, store)
}
store[[key]]
} else {
Expand All @@ -43,7 +43,7 @@ tesseract <- local({
#' @export
#' @rdname tesseract
#' @param filter only list parameters containing a particular string
#' @examples tesseract_params('debug')
#' @examples tesseract_params("debug")
tesseract_params <- function(filter = ""){
tmp <- print_params(tempfile())
on.exit(unlink(tmp))
Expand Down
Binary file added inst/examples/bowers.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added inst/examples/chinese.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions inst/examples/chinese.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
奧林匹克運動會(希臘語:Ολυμπιακοί Αγώνες;法語:Jeux olympiques;英語:Olympic Games),簡稱奧運會、奧運,是世界最高等級的國際綜合體育賽事,由國際奧林匹克委員會主辦,每4年舉行一次。冬季競技項目創立冬季奧林匹克運動會後,之前的奧林匹克運動會則是又稱為「夏季奧林匹克運動會」以示區分。從1994年起,冬季奧運會和夏季奧運會分開,相隔2年交替舉行。奥林匹克運動會最早起源於古希腊,是當時各城邦之間的公開較量,因為舉辦地在奧林匹亚而得名。信奉基督教的羅馬皇帝狄奧多西一世以奧林匹克運動會崇拜耶穌以外神衹為由,禁止奧運競技,於是奧運在舉辦超過1,000年後於4世紀末停辦,奧運這次停辦持續了1,503年,直到19世纪末才由後人發現遺蹟。之後,法國的顾拜旦男爵皮耶·德·古柏坦創立了有真正奧運精神的現代奧林匹克運動會,自1896年開始每4年舉辦一次,更確立了會期不超過18日的傳統。現代奧運會只在兩次世界大戰期間合共中斷過5次(分別是1916年夏季奧運會、1940年夏季奧運會[1]、1940年冬季奧運會[1]、1944年夏季奧運會和1944年冬季奧運會)[註 1],以及在2020年因全球防疫延期過一次(2020年夏季奧運會[2][註 2])。
Binary file added inst/examples/french.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added inst/examples/ocrscan.pdf
Binary file not shown.
Binary file added inst/examples/polytonicgreek.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added inst/examples/receipt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added inst/examples/tealbook.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added inst/examples/testocr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/bowers.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/chinese.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/polytonicgreek.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/receipt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/tealbook.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/testocr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 6 additions & 5 deletions man/ocr.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion man/tessdata.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

29 changes: 29 additions & 0 deletions man/tesseract-package.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/tesseract.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading