-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8028: Add PDF Format Plugin #2359
Conversation
@cgivre Thanks for the new format plugin. Is it ready for review? |
This should be ready for review :-) |
Cool contribution. I'm not entirely convinced this is something that Drill should handle. There are too many variables for the very limited controls which Drill provides. It is likely that this will work for one or two limited use cases, but not the vast majority of PDF files. Using JSON plugin config files to specify the mapping is awkward. Probably each file will need its own config, which is not scalable. Drill's fundamental design is to run at scale. It is highly unlikely that someone will use PDF files to store GB of data. If they do, they have problems bigger than Drill can help them solve. Thus, this kind of plugin works only at the small scale: one or two files in, say, an embedded Drillbit with JDBC or SQLine. A better choice would be to wrap this thing in a script: tinker with the PDF extraction, using whatever tools are available, to get the right mapping. Then, wrap that in a script that produces, say, a CSV format to stdout. Drill can then read that input. Such an approach enables all manner of ad-hoc, small scale data extraction. Or, maybe Drill should offer a "desktop edition" that is designed for small, ad-hoc projects based on local files, with some way to handle all the tinkering needed when reading PDF files, images, Word files, spreadsheets, Twitter feeds, Slack posts another formats popular with data scientists. Such features would not normally be part of the massive-scale deployments for which Drill is designed. Thoughts? |
@paul-rogers Thanks for the feedback. I was on the fence about submitting this as well. I was kind of doing a "branch purge" and wanted to contribute some experiments I had been poking at. Let me first give you the backstory... Several years ago, I worked on an analytics project which unfortunately, we had to scrap because the data we received was in PDF format. While there are python libraries that can parse PDFs, (and probably others), many have C dependencies which we could not build in our environment. This project weighed on me because had we been able to actually read the data, we could have solved a mission critical issue. Fast forward a few years, I got into Drill, learned how to write format plugins, and stumbled on the Tabula library which is built for the purpose of extracting and parsing tables from PDF files. I wanted to see if Drill could read these and if so, how well it did at extracting the tables and getting usable data from them. The actual result was that it worked surprisingly well. The extractors here do a very good job of identifying tables and extracting them into a tabular structure. OK... see inline for the rest.
I wasn't so sure either. However, we do already support file types which are not used for big data processing such as ESRI shape files, and a few other random formats. If you are viewing Drill solely as a query engine for big data, then yes this makes no sense. If you are viewing Drill as a Swiss Army knife for data, then it makes more sense.
Let me clarify that this is only for PDF files which contain tables. (Either images that look like tables or actual text tables). By design this will not work with all PDF files.
I'm not sure what you're getting at here with the JSON config. Could you be mixing this up with another recent PR which did add some JSON config options to the HTTP/REST storage plugin? In any event, this format reader only has two config variables. One which merges all the tables in the document together (in the case of a table which spans multiple pages), and the other which specifies the index of the table to parse in the event of a document with multiple tables that are not related.
True... The issues that I ran into were that it was surprisingly difficult to get the libraries that can parse PDFs to run.
I actually really do like the idea of a mini-drill. For smaller use cases but without all the overhead and complexity. With that said, I don't really see a harm of including this. Clearly this is not Drill's primary use case, but it can be really valuable. I was actually helping my wife the other day with some data for her work. In that situation, a state government agency sent her data in PDF format which we had to enter into a website. Using Drill we were able to query the file and transform it into something actually useful. |
@cgivre @paul-rogers, my 2c. I guess some partial precedents for a format plugin like this are ones like format-image and format-esri (as noted), though those do only go after the explicitly structured content. It would not surprise me if there are quite often cases of unfortunate people needing to scrape data out of 10^0 - 10^5 PDFs. The purist in me agrees with Paul's thought: a Groovy script over whatever Java lib is used here could be employed instead. That would not be automatically be parallelised like a Drill query is so a user with many PDFs might be pushed all the way through GNU parallel to Spark. All that is followed by this recurring thought: if Drill is disciplined and focussed only on SQL against standard big data formats then it finds itself trying to compete in an uncomfortably crowded space. Probably fatally crowded. I do also note that the "big" and "small" data worlds are not disjoint. I have in practice joined big data in Parquet with small reference data in Excel (actually EtherCalc). Even in the big data regime reference data remains small and is maintained by humans in human forrmats rather than pumped out by machines in machine formats. This is getting long again. Last thoughts.
|
@cgivre, @dzamo raise good points. So, what is Drill today? Is it still primarily used for distributed queries at scale? Or, as a handy desktop tool for data scientists? Probably both. The problem is, a tool that tries to be both a desert topping and a floor wax (let's see how old the readers are with this one), ends up being good at neither. One approach, if we had resources, would be to create a Drill Desktop that is optimized for that case and encourages all kinds of specialized data connectors. Create an easy way to define those connectors (YAML files created by specialized web apps?) Ensure Drill has good integration with Jupyter and the other usual suspects. Another approach, if we had resources, is the oft-discussed idea of separating the less-common plugins from the Drill core. Work started on this: to create an extension mechanism that made this possible. (Today, most plugins need quite a bit of Drill internal code.) So, no harm in adding the PDF reader, but I expect usage will be pretty limited just because, for the folks that need it, configuration will be too hard. Better would be a Python or Spark job that extracts the data into a CSV file, then query the CSV file with Drill. Each step could be debugged easily. I can't imagine anyone will want to debug their PDF extraction using Drill's overly generous Java stack traces... |
@paul-rogers you got me with this idiom, but I like it! The broader topic is super interesting. If SQLite started adding the features needed to compete with Oracle Database 21c it would quickly fail at being SQLite. If Linux tried to be an OS kernel for both TVs and supercomputers it would... continue to dominate both extremes! There are some twists here! Pigeonholing formats into small scale and large scales is also a tricky business. For example, we naturally want to declare PDF a desktop format, but I can easily imagine a conversation like the following. "Hey Bob, remember that we sent decades of paper archives from the basement out to that big scanning centre for digitisation? They've come back as millions of pages of PDFs. Someone just asked me if we can help them find all invoices containing a particular SKU, and pull out the price on that line. The ERP system only has the last 10 years loaded into it and they want to go back further". "Chuck 'em in HDFS, we'll run a Drill query" "But PDF is a desktop publishing format, not a big data format! Surely our big data cluster will want nothing to do with it!?" "Drill's got a plugin architecture which led to people adding support for all sorts of weird and wonderful formats. Querying PDFs is a dubious business but we'll know after ~10 lines of SQL if we can do this with Drill or not. If not, miserable days or weeks of programming with a PDF library await one of our interns." |
@dzamo, the reference was to an early Saturday Night Live skit. Just to refocus the discussion, my question is really around configuration. When querying CSV, Parquet and the like, it is very clear where the data lies. When querying Excel, PDF, HTML, Word, "the web", it is less clear: there is some amount of data mining needed to say that it is THIS table and not THAT one, that the table spans three pages, etc. The question is, how does the user specify this? If it were me, I would not want to be tinkering with a JSON storage plugin config, watching my query fail with a Drill stack trace, or wondering why I got no data. Instead, I'd want a tool better suited for the task. Once I had that, if I then wanted to run at scale, I'd want to say, "Drill, just use X and consume the data." So, the question here is: is the JSON storage plugin config an effective way for a data scientist to mine PDF, Excel, HTML and other messy sources? I don't have an answer, I'm just asking the question. Again, if we had the ability to have true external plugins, then this could easily be an add-on project. Those who know PDF could go hog wild creating a good solution. But, we don't, so every specialized plugin has to be part of core Drill. Is that good? That's also a question for debate. |
@paul-rogers, right, okay it's an expressiveness thing here rather than a scale thing. The expressiveness of Drill SQL ∪ Drill format config JSON falls well short of that of a general purpose scripting language and for reading fiddly unstructured data that shortfall might rapidly become uncomfortable. The format config for this particular plugin looks quite succinct, like the plugin will either automagically get your data out, or it won't and then you need to pack up and go and open the interpreter of your favourite scripting language. Making your resulting script scale to millions of pages, if it that's needed, is left to the student. I quite like the Ray project for Python myself. This thread has triggered some thoughts. If we find ourselves starting to write long essays of JSON in format configs then we should probably be concerned. If we find ourselves trying to embed a miniature data processing DSL into format config JSON then we need to stop moving immediately and pray to the ancestors that we might be shown a path that will return us from wilderness. I want to revisit the draft fixed width format plugin with these ideas in mind. Its config allows setting names and types for columns, but for other formats we must do this in SQL. I think we should only ever do this in SQL. I think we can do something on the packaging front. These format plugins live under contrib/ in the source tree and are compiled to their own jar files. If we simply change the final tarball-building stage of our Maven build to give us something like the following on our download page, would we not be in reasonable shape?
P.S. We'd be persisting with a monolithic Git repo containing multiple "projects" here, but I personally don't mind mono repos. |
@cgivre Instead of generating |
It's possible to do that, but a few things:
|
@dzamo @paul-rogers , These are all interesting points. I'd like to focus on the config variables for a moment. The PDF reader has 2 config variables: the IMHO, this flexibility is actually very good for the user, because it allows an administrator to configure global default values that make sense but also allow a user to make query-time changes so they can access their data. With respect to the fixed-width plugin (#2282) I actually have a different vision of how this can be used, and will post comments there. |
This consideration will only apply for
The columns array does allow text files to contain jagged arrays, which is perhaps valuable? I don't see a huge saving in typing
My fear, and I don't know if this is real or not, is that if we profilerate plugin-specific quirks in how the schema is represented to the user then the promise of a standard SQL interface to the data gets tainted and developers and users will be put off. "It's standard except every plugin presents its data the way that author prefers" I don't feel all that strongly about |
The bigger implication of having a |
@cgivre Oh right, that makes sense. So should we put in support for both @paul-rogers with apologies to this PR for saddling it with so much broader design chat, I wanted to share a last set of findings from talking with others with you and finally ask if we might go over a couple of questions with you, away from this PR.
I would love to consult with you on 2 and 3 in a sort of "Very well, if you must do a distribution of Drill with this long tail of formats, storage systems and UDFs in it then at least equip yourselves with the following practices" chat. Perhaps in the upcoming community meetup, otherwise outside (if it's of any interest on your end of course). Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgivre It is a good solution for drill the table data in PDF. Please take note of my comments.
contrib/format-pdf/src/main/java/org/apache/drill/exec/store/pdf/PdfFormatConfig.java
Outdated
Show resolved
Hide resolved
contrib/format-pdf/src/main/java/org/apache/drill/exec/store/pdf/PdfFormatConfig.java
Outdated
Show resolved
Hide resolved
contrib/format-pdf/src/main/java/org/apache/drill/exec/store/pdf/Utils.java
Outdated
Show resolved
Hide resolved
contrib/format-pdf/src/test/java/org/apache/drill/exec/store/pdf/TestPdfFormat.java
Show resolved
Hide resolved
|
||
@Override | ||
public boolean open(FileScanFramework.FileSchemaNegotiator negotiator) { | ||
System.setProperty("java.awt.headless", "true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This property has been set many times, maybe it can be done like this :
if (System.getProperty("java.awt.headless") == null) {
System.setProperty("java.awt.headless", "true");
}
And remove the Utils.java
, line 60. PdfBatchReader.java
, line 164?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my. Drill can run embedded. Suppose I'm running Drill embedded in my AWT application. Will this break things?
Drill is multi-threaded: we an have a dozen readers all setting this same system property concurrently with other code reading it. Is this a safe thing to do?
More fundamentally, should we be mucking about with shared system properties in a reader? Probably not.
contrib/format-pdf/src/main/java/org/apache/drill/exec/store/pdf/PdfBatchReader.java
Outdated
Show resolved
Hide resolved
// Check to see if the limit has been reached | ||
if (rowWriter.limitReached(maxRecords)) { | ||
return false; | ||
} else if (currentRowIndex >= currentTable.getRows().size() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the logic to handle a table that spans multiple pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, I'm a bit confused by the comparison of row count and table count. Imagine each table has 100K rows. It would have to split across Drill batches. And, if each table has 3 rows, all tables should be combined in a single Drill batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned this up by wrapping this in iterator classes. This logic was here to support the page merge feature. Basically... it works like this... a Page
can contain 0 or more Tables
. If you take a look at the included file schools.pdf
you'll see it is a 5 page spreadsheet. The problem is that if you wanted to actually query this data, you'd want the whole thing, not an individual page. So, what Drill can do is merge all the Pages (each containing 1 table) together.
You would only want to use this feature in the case of multi-page spreadsheet PDFs. The logic therefore is:
- Read the current table until there are no more rows.
- Get the next table (if there is one)
- Read it until there are no more rows
- Go back to step 2.
contrib/format-pdf/src/main/java/org/apache/drill/exec/store/pdf/PdfBatchReader.java
Outdated
Show resolved
Hide resolved
@dzamo, @cgivre: we should certainly encourage people to add plugins (connectors) to everything under the sun. The question really is: does the rather limited Drill core team want to support all of them at the same level as the main ones that Just Gotta Work. For huge files. On local, S3, HDFS. With splits. With metadata. With excellent error reporting. And on and on. The PDF thing seems like the prototypical "not in the core, but a great add-on" example. Yes, some one user might want to scan zillions of PDF files on an HDFS system. But, the core Drill team can't spend its resources on that when there are still basics to address. Let's do this: let's get this into Drill for now. Then:
This way, even without a marketplace, folks can build their own connectors separate from the Drill repo. A next step might be to split Drill's monorepo into multiple parts, as do many other projects. The current repo would be the core. Then, the contrib stuff could move to its own repo(s) under the Drill project. And, users can create their own extensions in their own repos. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple comments: some architectural, some nits.
There are many example PDF files in the test directory. Not clear from their names what they are used for. Maybe add a "description.txt" or some such to explain how we use each for testing. That is:
- china.pdf: Verifies that Unicode (Chinese) characters are read correctly.
- ...
|
||
### Merging Pages | ||
The PDF reader reads tables from PDF files on each page. If your PDF file has tables that span multiple pages, you can set the `combinePages` parameter to `true` and Drill | ||
will merge all the tables in the PDF file. You can also do this at query time with the `table()` function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose my file has three, mutually incompatible tables, and I only want to read the first. Can I? If so, how?
Or, do I read all of them (sales of apples by state, shelf life of various apple kinds, list of largest apple growers) into a big messy, combined table, then use a WHERE
clause to try to keep just the shelf life info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I envisioned this working was for a user to use the table()
function and specify the table index at query time. That way they can query as many incompatible tables as they want and not have to read ones they don't.
contrib/format-pdf/README.md
Outdated
} | ||
``` | ||
The avaialable options are: | ||
* `extractHeaders`: Extracts the first row of any tables as the header row. If set to false, Drill will assign column names of `field_0`, `field_1` to each column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain a bit more. I have two tables with titles "a | b", and "c | d | e". Do I get a union row with "a, b, c, d, e" as fields? Or, just "a, b" as fields? When I read the second table, do I then get "a, b, e"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on a per-table basis. Really it works the same way the CSV reader would work if you tried to query multiple files in a directory.
In the example you cited, you'd get a table with columns, a | b | c | d | e. The rows from table 1 would include null
entries for the columns c | d | e and the rows from table 2 would contain nulls for columns a | b.
contrib/format-pdf/src/main/java/org/apache/drill/exec/store/pdf/Utils.java
Outdated
Show resolved
Hide resolved
List<String> values = new ArrayList<>(); | ||
|
||
if (firstRow != null) { | ||
for (int i =0; i < firstRow.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: space after =
.
What happens if the rows have different sizes? Is that possible if we guessed wrong about the table? If the table merges cells? How should we handle such issues? Do we have tests for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't test for all those use cases, but the underlying library (Tabula) actually has a pretty robust collection of tests that do cover a lot of what you're asking about.1
In my manual testing, what I found was that if you had inconsistent tables, Tabula would basically produce some sort of mapping to columns and rows that Drill could read. Not ideal, but neither is trying to extract data from PDFs.
Copy that @paul-rogers, I have incorporated your advice above into the Drill 2.0 wiki page. I am hopeful that after 1.20 is out the door we can split out the new branches, and repos, for 2.0 (but it is of course the project elders that decide such things). In my youthful exuberance I imagine it starting early next year! I would think that getting the source code reorganised in a way that will carry us into the future should probably be the very first Drill 2.0 activity to take place. Surely we would want our most experienced contributors to define such a task. |
Dear PR author and reviewers. This is a generic message to say that we would like to merge this PR in time for the 1.20 release. Currently we're targeting a master branch freeze date of 2021-12-10 (10 Dec). Please strive to complete development and review by this time, or indicate that the PR will need more time (and how much). Thank you. |
b62648a
to
80bb4fd
Compare
Thanks @pjfanning ! Unfortunately, that didn't solve the issue with opening up the java window. I emailed the PDFbox user list and hopefully they can give me some suggestions. |
@paul-rogers When Drill runs the unit tests, we actually have the I believe that I've addressed all your other comments as well. Thank you as always for the thorough review! It's ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not convinced that this should be part of the core Drill code: reading PDF files pushes Drill in the direction of a small-scale, query-anything tool rather than a large-scale distributed query tool. Folks won't, at the same time, run 100 Drill nodes and muck with PDF files: PDF is a horrible format to use for data at a scale.
Further, I'm not convinced the SQL and JSON are the best format for working out how to parse a PDF file. If the user is naive enough to use PDF as their input format, they're not going to be savvy enough to figure out the ad-hoc, clunky way they have to configure stuff using SQL and JSON.
That said, that conversation was already had, and we're proceeding anyway. Given that, the code look good. So, +1.
|
||
while(!rowWriter.isFull()) { | ||
// Check to see if the limit has been reached | ||
if (rowWriter.limitReached(maxRecords)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd forgotten this comment. Limit was added to the EVF framework in the pending EVF V2 PR.
DRILL-8028: Add PDF Format Plugin
Description
This PR adds the capability to query data contained in PDF tables.
Documentation
One of the most annoying tasks is when you are working on a data science project and you get data that is in a PDF file. This plugin endeavours to enable you to query data in
PDF tables using Drill's SQL interface.
Data Model
Since PDF files generally are not intended to be queried or read by machines, mapping the data to tables and rows is not a perfect process. The PDF reader does support
provided schema.
Merging Pages
The PDF reader reads tables from PDF files on each page. If your PDF file has tables that span multiple pages, you can set the
combinePages
parameter totrue
and Drillwill merge all the tables in the PDF file. You can also do this at query time with the
table()
function.Configuration
To configure the PDF reader, simply add the information below to the
formats
section of a file base storage plugin.The avaialable options are:
extractHeaders
: Extracts the first row of any tables as the header row. If set to false, Drill will assign column names offield_0
,field_1
to each column.combinePages
: Merges multipage tables together.defaultTableIndex
: Allows you to query different tables within the PDF file. Index starts at0
.Accessing Document Metadata Fields
PDF files have a considerable amount of metadata which can be useful for analysis. Drill will extract the following fields from every PDF file. Note that these fields are not
projected in star queries and must be selected explicitly. The document's creator populates these fields and some or all may be empty. With the exception of
_page_count
which is anINT
and the two date fields, all the other fields areVARCHAR
fields.The fields are:
_page_count
_author
_title
_keywords
_creator
_producer
_creation_date
_modification_date
_trapped
_table_count
The query below will access a document's metadata:
The query below demonstrates how to define a schema at query time:
Testing
Added unit tests