Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabulizer: Bindings for Tabula PDF Table Extractor Library #42

Closed
9 tasks done
leeper opened this issue May 7, 2016 · 39 comments
Closed
9 tasks done

tabulizer: Bindings for Tabula PDF Table Extractor Library #42

leeper opened this issue May 7, 2016 · 39 comments

Comments

@leeper
Copy link
Member

leeper commented May 7, 2016

    1. What does this package do? This page wraps the Tabula Java library, which can (very accurately) extract tables from PDF documents. It also implements some lower level utilities for working with PDF documents (metadata and text extraction, image conversion, split/merge). It should be useful for extracting scientific data, especially tabular data, from PDFs, such as from scientific articles or agency reports.
    1. Paste the full DESCRIPTION file inside a code block below.
Package: tabulizer
Type: Package
Title: Bindings for Tabula PDF Table Extractor Library
Version: 0.1.11
Date: 2016-05-07
Authors@R: c(person("Thomas J.", "Leeper", role = c("aut", "cre"),
                    email = "[email protected]"))
Maintainer: Thomas J. Leeper <[email protected]>
Description: Bindings for the Tabula <http://tabula.technology/> java library, which can extract tables from PDF documents.
License: MIT + file LICENSE
URL: https://github.com/leeper/tabulizer
BugReports: https://github.com/leeper/tabulizer/issues
Imports:
    graphics,
    grDevices,
    utils,
    tools,
    tabulizerjars,
    rJava,
    png
Suggests:
    testthat,
    knitr
RoxygenNote: 5.0.1

    1. URL for the package (the development repository, not a stylized html page) https://github.com/leeper/tabulizer
    1. What data source(s) does it work with (if applicable)? n/a
    1. Who is the target audience? Data scientists stuck with other peoples' data
    1. Are there other R packages that accomplish the same thing? If so, what is different about yours? The closest thing will be pdftools, which is a libpoppler wrapper. tabulizer has some overlap but the core functionality - table extraction - is not supported by pdftools.
    1. Check the box next to each policy below, confirming that you agree. These are mandatory.
    • This package does not violate the Terms of Service of any service it interacts with.
    • The repository has continuous integration with Travis CI and/or another service
    • The package contains a vignette
    • The package contains a reasonably complete README with devtools install instructions
    • The package contains unit tests
    • The package only exports functions to the NAMESPACE that are intended for end users
    1. Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
    • [] Are there any package dependencies not on CRAN? The package has some files in a dependent package, which can be put on CRAN once everything is ready to release.
    • Do you intend for this package to go on CRAN?
    • Does the package have a CRAN accepted license?
    • Did devtools::check() produce any errors or warnings? If so paste them below.
    1. Please add explanations below for any exceptions to the above:
    1. If this is a resubmission following rejection, please explain the change in circumstances.
@sckott
Copy link
Contributor

sckott commented May 9, 2016

Thanks for your submission @leeper Seeking reviewers now

@sckott
Copy link
Contributor

sckott commented May 10, 2016

Reviewers: @lmullen, @davidgohel
Due: 2016-06-01

@davidgohel
Copy link

davidgohel commented May 24, 2016

This package is a very useful new package. It is also well written ; code is simple and clear, standards are respected. It was easy and fun to dive into the code. (Also I realized I could simplify some of my old
codes while reading yours!). 👏

This work is a very good reference for package developpement with rJava.

I have tested it on:

  • mac OS (R version 3.2.4)
  • Windows (R version 3.2.3 32 bits) - locale is french Cp1252

Everything worked as expected. Installation is simple and the vignette is clear. I enjoy the data.frame method, it worked well on relevant documents.

There is nothing to complain about in the code. However, I made efforts to find things to say ;)


Headless Mode

I found out one issue that can be easily solved. Headless Mode: it should be set to true by default (this can be an issue with RStudio server for example). If not, pdfbox might not work on some servers. See this article.

In .onLoad function, this can easily setup with

J('java.lang.System')$setProperty('java.awt.headless', 'true')

Encoding issue with Windows

I met one annoying issue. I made the french test, trying to use tabulizer with a document full of terrible accents. Everything went fine on mac, but I met encoding issues with Windows.

Note : I tried to set encoding parameters but it did not solve the issue.

J('java.lang.System')$setProperty('file.encoding', 'UTF-8')

It seems the problem come from pdfbox and the way encodings are managed (not sure about that).

To demo the issue:

  1. Save manually the document from http://www.sfds.asso.fr/ressource.php?fct=ddoc&i=2090 to ZOOM_METIERS_MATHS_WEB_144dpi3.pdf in your working directory.
  2. Run
f <- "ZOOM_METIERS_MATHS_WEB_144dpi3.pdf"
extract_text(f, pages = 2)

Below a partial copy of the result :

[1] "ALAURÉAT\r\nANNÉE\r\n1\r\nANNÉE\r\n2\r\nANNÉE\r\n6\r\n3e année\r\nANNÉE\r\n8\r\nANNÉE\r\n5\r\nÉCOLE\r\nD'INGÉNIEURS\r\n2e année\r\nANNÉE\r\n7\r\nÉCOLE\r\nD'INGÉNIEURS\r\nANNÉE\r\n3\r\nANNÉE\r\n4\r\nLICENCE PRO\r\nÉCOLE\r\nD'INGÉNIEURS\r\nClasse prépa2e année\r\nClasse prépa1re année\r\nCycle \r\nintégré\r\n1re année\r\nCycle \r\nintégré\r\n2e année\r\nL3\r\nL2\r\nL1\r\n1re année\r\nM2\r\nM1\r\nLICENCE\r\nMASTER\r\nDOCTORAT\r\nDIPLÔME\r\nD'ING

Below a partial copy of the result on my mac :

[1] "ALAURÉAT\nANNÉE\n1\nANNÉE\n2\nANNÉE\n6\n3e année\nANNÉE\n8\nANNÉE\n5\nÉCOLE\nD'INGÉNIEURS\n2e année\nANNÉE\n7\nÉCOLE\nD'INGÉNIEURS\nANNÉE\n3\nANNÉE\n4\nLICENCE PRO\nÉCOLE\nD'INGÉNIEURS\nClasse prépa2e année\nClasse prépa1re année\nCycle \nintégré\n1re année\nCycle \nintégré\n2e année\nL3\nL2\nL1\n1re année\nM2\nM1\nLICENCE\nMASTER\nDOCTORAT\nDIPLÔME\nD'ING

Minor issues / suggestions

DESCRIPTION file

I think R-Core team could ask you to add authors of the java packages as contributor and copyright holder (that was my case for package ReporteRsjars) in package tabulizerjars. To prevent this issue when submitting to CRAN, I would add Manuel Aristarán in field Authors@R of the DESCRIPTION file as:

person("Manuel", "Aristarán", role = c("ctb", "cph"), comment = "tabula java library")

file path with ~

It seems file path containing ~ are not being expanded (occurred on mac os)

Verbose logging

As a user, I would prefer not having the logger messages. If it can help, it can be switch off :

Solution 1 : Desable ALL loggers

J("java.util.logging.LogManager")$getLogManager()$reset()

Solution 2 : Desable only pdfbox loggers

BUT it would require org.apache.log4j in the classpath.

logManager <- rJava::J("java.util.logging.LogManager")$getLogManager()

loggers <- c("org.apache.pdfbox.util.PDFStreamEngine",
  "org.apache.pdfbox.pdmodel.font.PDSimpleFont",
  "org.apache.pdfbox.pdmodel.font.PDFont",
  "org.apache.pdfbox.pdmodel.font.FontManager",
  "org.apache.pdfbox.pdfparser.PDFObjectStreamParser")

for( logger_id in loggers ){
  logger <- logManager$getLogger(logger_id)
  logger$setLevel(J("org.apache.log4j.Level.OFF")) 
}

@leeper
Copy link
Member Author

leeper commented May 25, 2016

Thank you, @davidgohel! I have now implemented most of this, but am still working on the non-latin character issue and finalizing the logging functionality.

@sckott
Copy link
Contributor

sckott commented May 31, 2016

@lmullen @davidgohel
Due: 2016-06-01 - hey there, it's been 21 days, please get your review in soon, thanks 😺 (ropensci-bot)

@leeper
Copy link
Member Author

leeper commented May 31, 2016

Following up on the first review, I believe I have now addressed all issues:

  1. rJava is configured to run in headless mode on package load.
  2. I have added an encoding argument to extract_text() and extract_tables() that allows the specification of text encodings. I'm not 100% confident in this, but it seems to work.
  3. I've updated authorship in DESCRIPTION of tabulizerjars to reflect the java library's authorship. I've also added @davidgohel's contributions as a reviewer.
  4. I use path.expand() for local file paths.
  5. I have added a stop_logging() function to turn off all Java logging. This seemed easier than adding a log4j dependency, but may not be ideal, so logging is left on by default.

Thank you again for the feedback!

@lmullen
Copy link
Member

lmullen commented Jun 1, 2016

This is a very well done package. The code is a pleasure to read, and the interface is well thought out. The core functionality works as expected.

This package is so useful that I would really love it if this package worked on PDFs of 19th-century documents: alas, it's not to be. (Hardly the fault of the package, of course.)

@davidgohel has already offered a number of suggestions which have been dealt with, so I'll deal mostly with unexpected errors when using the interface.

Tests

I've run R CMD check on this package with R 3.3.0 on Mac OS X 10.11.5 and R 3.3.0 on Ubuntu 16.04. (The rest of my review was on the Mac. I don't have a Windows machine to test that installation, but the instructions in the README seemed clear.)

  • The Travis integration on the master branch is failing.
  • I get two test failures when building and running R CMD check (on both Mac and Ubuntu):
  OK: 73 SKIPPED: 0 FAILED: 2
  1. Failure: Read French language PDF w/correct encoding (@test_non-latin.R#17) 
  2. Failure: Read French language PDF w/correct encoding (@test_non-latin.R#18) 

I gather that this is part of the issue which @davidgohel mentioned which is being worked out.

Interface

For testing I used a PDF form the U.S. Census's Statistical Abstract

Some minor questions:

  • Is there are reason why get_n_pages() takes an option doc = parameter first instead of the file = parameter, like most other functions in this package?
  • Can split_pdf() take an argument like outdir = to specify where the split files should go? Also, it would be nice if the numbers in the new file names were zero padded. More importantly, split_pdf() overwrites files in the directory without warning. The function needs some way of requiring the user's permission before overwriting files, especially since the function doesn't let the user specify the file name or path.
  • When using merge_pdfs() I get a warning message:
Warning message:
In if (grepl("^http.*://", path)) { :
  the condition has length > 1 and only the first element will be used

This comes from tabulizer:::localize_file(). The if() condition there probably needs to be if(any(CONDITION)). I suppose a user could pass the function a mix of remote and local PDFs so that the any() function would be insufficient, but that seems far-fetched.

  • I get an error with a traceback when using the option spreadsheet = TRUE to extract_tables().
 Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 

If this isn't a problem with the underlying Java code, then this error message should let the user know why the spreadsheet option shouldn't be used.

  • Same comments for make_thumbnails() as for split_pdf() above.

After playing with it for a while, I was able to figure out the interface to extract_areas(). Would it be possible to make two improvements to that user interface? First, instead of clicking and the upper left and lower right of the area, can the user draw a bounding box? Second, can the bounding box be adjusted before the function returns?

Miscellaneous

  • The coding style and documentation are very well done.
  • I found it a little odd that the README had the installation instructions near the bottom, since it goes against the convention. Can you move the installation instructions up (maybe leaving the Windows Java stuff at the bottom)? See the README section of the packaging guide.
  • The package has clear contributing guidelines. It would be good to also have the code of conduct as mentioned in the rOpenSci policies.
  • The package has a good NEWS file. But it also needs to use tags for all the point releases. I don't see any point in going back to tag the previous versions, but it should be tagged starting with acceptance at rOpenSci.
  • My two cents: I'd much prefer if the Java logging was off by default. I don't think it tells most users anything that they can actually use.

Nice work, and I look forward to using the package myself.

@leeper
Copy link
Member Author

leeper commented Jun 1, 2016

Thank you, @lmullen! I will work to address these points and post an update once I have finished.

@leeper
Copy link
Member Author

leeper commented Jun 1, 2016

I have attempted to address all of your comments and have made the following revisions:

  1. I have stopped the failing encoding tests (but have reopened an issue to get this to work better cross-platform).
  2. I have reordered the file and doc arguments to get_n_pages() for consistency and given get_page_dims() a doc argument, as well. I figure these may be useful for debugging.
  3. split_pdf() and make_thumbnails() gain an outdir argument. In both cases, output files are numbered using zero padding, as requested.
  4. I have fixed the merge_pdfs() warnings by avoiding passing a vector of filenames through localize_file().
  5. The error related to the spreadsheet = TRUE argument has been corrected and was quite critical, actually, as it related to the attempt to coerce an empty Java array to an R matrix.
  6. I have turned Java logging off by default and added some notes to the documentation to describe how to restore logging if that were, for some reason, preferred.
  7. Regarding the behavior of extract_areas(), I agree this is less than ideal but the current implementation isn't actually as strict as the documentation implies (you can tag upper-left/lower-right or upper-right/lower-left). I have an open issue on improving this, but I think there are some fundamental limitations in locator() and graphics devices more generally for creating an intuitive interface since it is not really possible to dynamically highlight/select an area of a plot (as far as I know). I am open to suggestions on this.

On smaller points:

  1. Regarding installation instructions, I think we're getting to the point where GitHub-based installation is sufficiently common that it doesn't need to be at the top of the README (since that seems to be a legacy of earlier times). I would actually suggest the general style guide be changed, accordingly. But I don't feel strongly about this, and can update if preferred.
  2. I read the code of conduct provision as only a recommendation, so I have not added this.
  3. My release workflow is to only tag "minor" not "patch" versions as GitHub releases and generally to only release to CRAN when minor ticks up, but I can adapt this to a different style if required.

Thank you, again, @davidgohel and @lmullen, for your feedback! It is very much appreciated.

@lmullen
Copy link
Member

lmullen commented Jun 1, 2016

This all sounds good to me, @leeper. My only suggestion is that tags are cheap so it doesn't hurt to add them.

@sckott and @karthik and @cboettig: I recommend accepting this package to rOpenSci.

@sckott
Copy link
Contributor

sckott commented Jun 1, 2016

thanks so so much for your reviews @davidgohel and @lmullen !

having a quick look myself ...

@sckott
Copy link
Contributor

sckott commented Jun 1, 2016

a few things:

  • before everything's on CRAN...seems like using Remotes in DESCRIPTION for tabulizerjars would make it so that would automatically install for users
  • After installing ``tabulizerjars\- Installation failed still, but after reinstalling rJava` from source from CRAN, then it worked
  • Getting errors on extract_areas and locate_areas, same error in each case:
#> Error in (function (title, width, height, pointsize, family, antialias,  :
#>   unable to create quartz() device target, given type may not be supported
#> In addition: Warning message:
#> In (function (title, width, height, pointsize, family, antialias,  :
#>   No displays are available

I guess lincoln and david didn't get this error though, so maybe i'm just an edge case

@lmullen
Copy link
Member

lmullen commented Jun 2, 2016

@sckott Are you using the version of R downloadable from CRAN? I'm using the version from Homebrew. I wonder if that makes the difference about the Quartz error. I know at one point or another I've installed XQuartz because of the Homebrew version.

@sckott
Copy link
Contributor

sckott commented Jun 2, 2016

yeah, from CRAN - I imagine we'd want this to work regardless of where R was installed from?

@leeper
Copy link
Member Author

leeper commented Jun 2, 2016

Thanks, @sckott. I'm working on these issues now.

@leeper
Copy link
Member Author

leeper commented Jun 3, 2016

I've just pushed an update to the extract_areas()/locate_areas() functions, which I hope might work better on Mac (not tested) and that offers much more interactive functionality on Windows (tested) and probably Linux (not tested yet) with the caveat that RStudio's graphic device apparently doesn't play well with others. Would any of you be willing to give this a quick run and see how it works for you?

@sckott
Copy link
Contributor

sckott commented Jun 3, 2016

will try soon

@davidgohel
Copy link

davidgohel commented Jun 3, 2016

Hi

I've just tested on mac. There is a first issue in extract_areas (line 96) . It shoud be grDevices::X11(type = "Xlib") and not grDevices::X11(type = "xlib").

But then I got a

Error in try_area(file = paths[i], dims = dims[[i]], area = areas[[i]],  : 
  Graphics device does not support rasterImage() plotting

Here are my dev.cap:

> grDevices::dev.capabilities()
$semiTransparency
[1] FALSE

$transparentBackground
[1] "fully"

$rasterImage
[1] "non-missing"

$capture
[1] TRUE

$locator
[1] TRUE

$events
[1] "MouseDown" "MouseMove" "MouseUp"   "Keybd"    

and my session info:

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tabulizer_0.1.18

loaded via a namespace (and not attached):
[1] tabulizerjars_0.1.2 tools_3.3.0         rJava_0.9-8         png_0.1-7          

@davidgohel
Copy link

And all is ok when changing line 98 to

if (grDevices::dev.capabilities()[["rasterImage"]] == "no") {

@sckott
Copy link
Contributor

sckott commented Jun 3, 2016

Same problem, getting an error with try_area

@leeper
Copy link
Member Author

leeper commented Jun 8, 2016

Thank you everyone for the help so far. I've made some pretty radical changes to the extract_areas() functionality to try to make sure this works across platforms. The biggest change is that when working in RStudio a shiny widget is provided to handle the extraction directly within RStudio. In other cases, it falls back on the graphics device-based functionality described previously (and hopefully fixed due to @davidgohel's testing) and, if graphics events are unavailable on the device, falls back on using locator(), and in the (hopefully rare case that even that is unavailable), fails entirely.

@sckott
Copy link
Contributor

sckott commented Jun 8, 2016

I still get the same quartz error as above - haven't been able to google any solutions for this, weird

@leeper
Copy link
Member Author

leeper commented Jun 8, 2016

Ah, that's frustrating. You're on OS X, right? What does Sys.info()["sysname"] return?

@sckott
Copy link
Contributor

sckott commented Jun 8, 2016

"Darwin"

@lmullen
Copy link
Member

lmullen commented Jun 9, 2016

👍 👍 👍 for the new Shiny interface.

Any particular reason why you can't use the Shiny interface outside of RStudio and just open the gadget in a browser instead of the Viewer?

@sckott
Copy link
Contributor

sckott commented Jun 9, 2016

use the Shiny interface outside of RStudio and just open the gadget in a browser instead of the Viewer?

yeah, should be possible

@leeper
Copy link
Member Author

leeper commented Jun 9, 2016

Yes, I can do that. I'd like to get the current version working, too, though because I think the graphics device interface is actually even better because it is responsive to keystrokes, so you can navigate through the pages of the file and make changes to area selections. As far as I can tell, the Shiny version can't be configured in that way.

@leeper
Copy link
Member Author

leeper commented Jun 19, 2016

I've now made the shiny interface the default, and added an option widget to control which interface is used for locate_areas() (and thus to extract_areas()). Sorry the delay in getting to it.

@sckott
Copy link
Contributor

sckott commented Jun 21, 2016

weird, now I'm getting the java error error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so': - will try to track that down

@sckott
Copy link
Contributor

sckott commented Aug 10, 2016

@leeper any progress on this?

@leeper
Copy link
Member Author

leeper commented Aug 16, 2016

Well, I think it's basically ready, with the exception of the installation issue on OS X. (I don't have a mac, so I'm not sure what I can really do about it, unfortunately.) I'm experimenting with travis OS X builds https://travis-ci.org/leeper/tabulizer/builds/152634437, which seem mostly to be working?

@sckott
Copy link
Contributor

sckott commented Aug 17, 2016

approved!

ran goopractice on it, not mandatory, but things to consider

It is good practice to
  ✖ omit "Date" in DESCRIPTION. It is not required and it
    gets invalid quite often. A build date will be added to the package
    when you perform `R CMD build` on it.

  ✖ avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    R/extract_tables.R:78:1
    R/extract_tables.R:79:1
    R/extract_tables.R:98:1
    R/locate_area.R:45:1
    R/locate_area.R:56:1
    ... and 52 more lines

  ✖ avoid sapply(), it is not type safe. It might return a
    vector, or a list, depending on the input data. Consider using
    vapply() instead.

    R/split_merge.R:59:13
    R/utils.R:38:5

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...),
    1:NROW(...) and 1:NCOL(...) expressions. They are error prone and
    result 1:0 if the expression on the right hand side is zero. Use
    seq_len() or seq_along() instead.

    R/output.R:65:19
    R/output.R:66:23

@sckott
Copy link
Contributor

sckott commented Aug 17, 2016

just a note about installation on my OSX 10.11.3 - I had to install a legacy version of Java from https://support.apple.com/kb/DL1572?locale=en_US instead of the version from Oracle that I had installed - after that installation works great, anyway ... now google know about this

@sckott
Copy link
Contributor

sckott commented Aug 17, 2016

If you can transfer to ropenscilabs when you get a chance that'd be great - and change dev install instructions of course - If you can't transfer let me know and I'll change permissions

@leeper
Copy link
Member Author

leeper commented Aug 18, 2016

Great! I've transferred the repositories, along with making all of the changes suggested in your goopractice note (except line length, because that's a bit of a pain to fix at the moment but I'll get to it). I also added a note about Java on Mac OS.

@sckott
Copy link
Contributor

sckott commented Aug 18, 2016

Nice, thanks, will tweet about it today

@sckott sckott closed this as completed Aug 18, 2016
@jazzido
Copy link

jazzido commented Aug 18, 2016

Thanks everyone from the @tabulapdf team!

@ajduncanson
Copy link

Sounds great, but I can't get it to install:

In response to ghit::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))

I get this:

Warning in as.POSIXlt.POSIXct(x, tz) :
unknown timezone 'default/Australia/Sydney'

Warning in as.POSIXlt.POSIXct(x, tz) :
unknown timezone 'default/Australia/Sydney'

ropensci/tabulizerjars ropensci/tabulizer
NA NA
Warning messages:
1: In utils::install.packages(to_install, type = type, repos = repos, :
installation of package ‘tabulizerjars’ had non-zero exit status
2: In utils::install.packages(to_install, type = type, repos = repos, :
installation of package ‘tabulizer’ had non-zero exit status

Why?

@lmullen
Copy link
Member

lmullen commented Dec 23, 2017

@ajduncanson This is not the right place for support. Try the tabulizer repo itself.

That said, the problem is probably not with tabulizer. The time zone warning sounds like the one with a recent version of R on MacOS, and you may have warnings set to errors. I suggest that you search for a solution to the time zone problem, check whether you have warnings set to errors, and if neither of those work, try an tabulizer repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants