Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8028: Add PDF Format Plugin #2359

Merged
merged 28 commits into from
Jan 9, 2022
Merged

DRILL-8028: Add PDF Format Plugin #2359

merged 28 commits into from
Jan 9, 2022

Conversation

cgivre
Copy link
Contributor

@cgivre cgivre commented Nov 2, 2021

DRILL-8028: Add PDF Format Plugin

Description

This PR adds the capability to query data contained in PDF tables.

Documentation

One of the most annoying tasks is when you are working on a data science project and you get data that is in a PDF file. This plugin endeavours to enable you to query data in
PDF tables using Drill's SQL interface.

Data Model

Since PDF files generally are not intended to be queried or read by machines, mapping the data to tables and rows is not a perfect process. The PDF reader does support
provided schema.

Merging Pages

The PDF reader reads tables from PDF files on each page. If your PDF file has tables that span multiple pages, you can set the combinePages parameter to true and Drill
will merge all the tables in the PDF file. You can also do this at query time with the table() function.

Configuration

To configure the PDF reader, simply add the information below to the formats section of a file base storage plugin.

"pdf": {
  "type": "pdf",
  "extensions": [
    "pdf"
  ],
  "extractHeaders": true,
  "combinePages": false
}

The avaialable options are:

  • extractHeaders: Extracts the first row of any tables as the header row. If set to false, Drill will assign column names of field_0, field_1 to each column.
  • combinePages: Merges multipage tables together.
  • defaultTableIndex: Allows you to query different tables within the PDF file. Index starts at 0.

Accessing Document Metadata Fields

PDF files have a considerable amount of metadata which can be useful for analysis. Drill will extract the following fields from every PDF file. Note that these fields are not
projected in star queries and must be selected explicitly. The document's creator populates these fields and some or all may be empty. With the exception of _page_count which is an INT and the two date fields, all the other fields are VARCHAR fields.

The fields are:

  • _page_count
  • _author
  • _title
  • _keywords
  • _creator
  • _producer
  • _creation_date
  • _modification_date
  • _trapped
  • _table_count

The query below will access a document's metadata:

SELECT _page_count, _title, _author, _subject, 
_keywords, _creator, _producer, _creation_date, 
_modification_date, _trapped 
FROM dfs.`pdf/20.pdf`

The query below demonstrates how to define a schema at query time:

SELECT * FROM table(cp.`pdf/schools.pdf` (type => 'pdf', combinePages => true, 
schema => 'inline=(`Last Name` VARCHAR, `First Name Address` VARCHAR, 
`field_0` VARCHAR, `City` VARCHAR, `State` VARCHAR, `Zip` VARCHAR, 
`field_1` VARCHAR, `Occupation Employer` VARCHAR, 
`Date` VARCHAR, `field_2` DATE properties {`drill.format` = `M/d/yyyy`}, 
`Amount` DOUBLE)')) 
LIMIT 5

Testing

Added unit tests

@cgivre cgivre added enhancement PRs that add a new functionality to Drill new-format New Format Plugin doc-impacting PRs that affect the documentation labels Nov 2, 2021
@cgivre cgivre self-assigned this Nov 2, 2021
@luocooong
Copy link
Member

@cgivre Thanks for the new format plugin. Is it ready for review?

@cgivre
Copy link
Contributor Author

cgivre commented Nov 3, 2021

@cgivre Thanks for the new format plugin. Is it ready for review?

This should be ready for review :-)

@paul-rogers
Copy link
Contributor

Cool contribution. I'm not entirely convinced this is something that Drill should handle. There are too many variables for the very limited controls which Drill provides. It is likely that this will work for one or two limited use cases, but not the vast majority of PDF files. Using JSON plugin config files to specify the mapping is awkward. Probably each file will need its own config, which is not scalable.

Drill's fundamental design is to run at scale. It is highly unlikely that someone will use PDF files to store GB of data. If they do, they have problems bigger than Drill can help them solve. Thus, this kind of plugin works only at the small scale: one or two files in, say, an embedded Drillbit with JDBC or SQLine.

A better choice would be to wrap this thing in a script: tinker with the PDF extraction, using whatever tools are available, to get the right mapping. Then, wrap that in a script that produces, say, a CSV format to stdout. Drill can then read that input.

Such an approach enables all manner of ad-hoc, small scale data extraction.

Or, maybe Drill should offer a "desktop edition" that is designed for small, ad-hoc projects based on local files, with some way to handle all the tinkering needed when reading PDF files, images, Word files, spreadsheets, Twitter feeds, Slack posts another formats popular with data scientists. Such features would not normally be part of the massive-scale deployments for which Drill is designed.

Thoughts?

@cgivre
Copy link
Contributor Author

cgivre commented Nov 4, 2021

@paul-rogers Thanks for the feedback. I was on the fence about submitting this as well. I was kind of doing a "branch purge" and wanted to contribute some experiments I had been poking at. Let me first give you the backstory... Several years ago, I worked on an analytics project which unfortunately, we had to scrap because the data we received was in PDF format. While there are python libraries that can parse PDFs, (and probably others), many have C dependencies which we could not build in our environment. This project weighed on me because had we been able to actually read the data, we could have solved a mission critical issue.

Fast forward a few years, I got into Drill, learned how to write format plugins, and stumbled on the Tabula library which is built for the purpose of extracting and parsing tables from PDF files. I wanted to see if Drill could read these and if so, how well it did at extracting the tables and getting usable data from them. The actual result was that it worked surprisingly well. The extractors here do a very good job of identifying tables and extracting them into a tabular structure.

OK... see inline for the rest.

Cool contribution. I'm not entirely convinced this is something that Drill should handle.

I wasn't so sure either. However, we do already support file types which are not used for big data processing such as ESRI shape files, and a few other random formats. If you are viewing Drill solely as a query engine for big data, then yes this makes no sense. If you are viewing Drill as a Swiss Army knife for data, then it makes more sense.

There are too many variables for the very limited controls which Drill provides. It is likely that this will work for one or two limited use cases, but not the vast majority of PDF files.

Let me clarify that this is only for PDF files which contain tables. (Either images that look like tables or actual text tables). By design this will not work with all PDF files.

Using JSON plugin config files to specify the mapping is awkward. Probably each file will need its own config, which is not scalable.

I'm not sure what you're getting at here with the JSON config. Could you be mixing this up with another recent PR which did add some JSON config options to the HTTP/REST storage plugin? In any event, this format reader only has two config variables. One which merges all the tables in the document together (in the case of a table which spans multiple pages), and the other which specifies the index of the table to parse in the event of a document with multiple tables that are not related.

Drill's fundamental design is to run at scale. It is highly unlikely that someone will use PDF files to store GB of data. If they do, they have problems bigger than Drill can help them solve.
Indeed!

Thus, this kind of plugin works only at the small scale: one or two files in, say, an embedded Drillbit with JDBC or SQLine.

A better choice would be to wrap this thing in a script: tinker with the PDF extraction, using whatever tools are available, to get the right mapping. Then, wrap that in a script that produces, say, a CSV format to stdout. Drill can then read that input.

True... The issues that I ran into were that it was surprisingly difficult to get the libraries that can parse PDFs to run.

Such an approach enables all manner of ad-hoc, small scale data extraction.

Or, maybe Drill should offer a "desktop edition" that is designed for small, ad-hoc projects based on local files, with some way to handle all the tinkering needed when reading PDF files, images, Word files, spreadsheets, Twitter feeds, Slack posts another formats popular with data scientists. Such features would not normally be part of the massive-scale deployments for which Drill is designed.

Thoughts?

I actually really do like the idea of a mini-drill. For smaller use cases but without all the overhead and complexity.

With that said, I don't really see a harm of including this. Clearly this is not Drill's primary use case, but it can be really valuable. I was actually helping my wife the other day with some data for her work. In that situation, a state government agency sent her data in PDF format which we had to enter into a website. Using Drill we were able to query the file and transform it into something actually useful.

@jnturton
Copy link
Contributor

jnturton commented Nov 4, 2021

@cgivre @paul-rogers, my 2c. I guess some partial precedents for a format plugin like this are ones like format-image and format-esri (as noted), though those do only go after the explicitly structured content. It would not surprise me if there are quite often cases of unfortunate people needing to scrape data out of 10^0 - 10^5 PDFs. The purist in me agrees with Paul's thought: a Groovy script over whatever Java lib is used here could be employed instead. That would not be automatically be parallelised like a Drill query is so a user with many PDFs might be pushed all the way through GNU parallel to Spark. All that is followed by this recurring thought: if Drill is disciplined and focussed only on SQL against standard big data formats then it finds itself trying to compete in an uncomfortably crowded space. Probably fatally crowded.

I do also note that the "big" and "small" data worlds are not disjoint. I have in practice joined big data in Parquet with small reference data in Excel (actually EtherCalc). Even in the big data regime reference data remains small and is maintained by humans in human forrmats rather than pumped out by machines in machine formats.

This is getting long again. Last thoughts.

  • I feel this really should be good at finding and parsing tables (work often, rather than work seldom) for inclusion.
  • We should consider a subproject to contain our long tail of "non-standard" data formats. This would separate away plugins that should not be expected to run with speed or realiability of the core data formats and keep the core distributable size down as we add format after format. We could then start to distribute tarballs for drill-core and drill-extra.
  • This looks like a plugin that will benefit from Drill's optional schema. That means we might have some unique ability to compete here.

@paul-rogers
Copy link
Contributor

@cgivre, @dzamo raise good points. So, what is Drill today? Is it still primarily used for distributed queries at scale? Or, as a handy desktop tool for data scientists? Probably both. The problem is, a tool that tries to be both a desert topping and a floor wax (let's see how old the readers are with this one), ends up being good at neither.

One approach, if we had resources, would be to create a Drill Desktop that is optimized for that case and encourages all kinds of specialized data connectors. Create an easy way to define those connectors (YAML files created by specialized web apps?) Ensure Drill has good integration with Jupyter and the other usual suspects.

Another approach, if we had resources, is the oft-discussed idea of separating the less-common plugins from the Drill core. Work started on this: to create an extension mechanism that made this possible. (Today, most plugins need quite a bit of Drill internal code.)

So, no harm in adding the PDF reader, but I expect usage will be pretty limited just because, for the folks that need it, configuration will be too hard. Better would be a Python or Spark job that extracts the data into a CSV file, then query the CSV file with Drill. Each step could be debugged easily. I can't imagine anyone will want to debug their PDF extraction using Drill's overly generous Java stack traces...

@jnturton
Copy link
Contributor

jnturton commented Nov 7, 2021

The problem is, a tool that tries to be both a desert topping and a floor wax (let's see how old the readers are with this one), ends up being good at neither.

@paul-rogers you got me with this idiom, but I like it! The broader topic is super interesting. If SQLite started adding the features needed to compete with Oracle Database 21c it would quickly fail at being SQLite. If Linux tried to be an OS kernel for both TVs and supercomputers it would... continue to dominate both extremes! There are some twists here!

Pigeonholing formats into small scale and large scales is also a tricky business. For example, we naturally want to declare PDF a desktop format, but I can easily imagine a conversation like the following.

"Hey Bob, remember that we sent decades of paper archives from the basement out to that big scanning centre for digitisation? They've come back as millions of pages of PDFs. Someone just asked me if we can help them find all invoices containing a particular SKU, and pull out the price on that line. The ERP system only has the last 10 years loaded into it and they want to go back further".

"Chuck 'em in HDFS, we'll run a Drill query"

"But PDF is a desktop publishing format, not a big data format! Surely our big data cluster will want nothing to do with it!?"

"Drill's got a plugin architecture which led to people adding support for all sorts of weird and wonderful formats. Querying PDFs is a dubious business but we'll know after ~10 lines of SQL if we can do this with Drill or not. If not, miserable days or weeks of programming with a PDF library await one of our interns."

@paul-rogers
Copy link
Contributor

@dzamo, the reference was to an early Saturday Night Live skit.

Just to refocus the discussion, my question is really around configuration. When querying CSV, Parquet and the like, it is very clear where the data lies. When querying Excel, PDF, HTML, Word, "the web", it is less clear: there is some amount of data mining needed to say that it is THIS table and not THAT one, that the table spans three pages, etc.

The question is, how does the user specify this? If it were me, I would not want to be tinkering with a JSON storage plugin config, watching my query fail with a Drill stack trace, or wondering why I got no data. Instead, I'd want a tool better suited for the task. Once I had that, if I then wanted to run at scale, I'd want to say, "Drill, just use X and consume the data."

So, the question here is: is the JSON storage plugin config an effective way for a data scientist to mine PDF, Excel, HTML and other messy sources? I don't have an answer, I'm just asking the question.

Again, if we had the ability to have true external plugins, then this could easily be an add-on project. Those who know PDF could go hog wild creating a good solution. But, we don't, so every specialized plugin has to be part of core Drill. Is that good? That's also a question for debate.

@jnturton
Copy link
Contributor

jnturton commented Nov 8, 2021

@paul-rogers, right, okay it's an expressiveness thing here rather than a scale thing. The expressiveness of Drill SQL ∪ Drill format config JSON falls well short of that of a general purpose scripting language and for reading fiddly unstructured data that shortfall might rapidly become uncomfortable. The format config for this particular plugin looks quite succinct, like the plugin will either automagically get your data out, or it won't and then you need to pack up and go and open the interpreter of your favourite scripting language. Making your resulting script scale to millions of pages, if it that's needed, is left to the student. I quite like the Ray project for Python myself.

This thread has triggered some thoughts. If we find ourselves starting to write long essays of JSON in format configs then we should probably be concerned. If we find ourselves trying to embed a miniature data processing DSL into format config JSON then we need to stop moving immediately and pray to the ancestors that we might be shown a path that will return us from wilderness. I want to revisit the draft fixed width format plugin with these ideas in mind. Its config allows setting names and types for columns, but for other formats we must do this in SQL. I think we should only ever do this in SQL.

I think we can do something on the packaging front. These format plugins live under contrib/ in the source tree and are compiled to their own jar files. If we simply change the final tarball-building stage of our Maven build to give us something like the following on our download page, would we not be in reasonable shape?

Package Size Description
drill-core 300MB Drill with core storage layer libs only. Use this in a focussed big data environment to query standard formats like Parquet, CSV and JSON in HDFS or object storage with predictable results and performance. Supplement this with indiviudal plugins listed below as needed.
drill-ktichen-sink 1.5GB Drill core plus all 100+ storage and format plugins. Use this for maximum compatibility. Results and performance may vary across plugins.
drill-storage-jdbc 130KB Plugin to query systems that provide a JDBC driver using a generic SQL dialect.
drill-format-pdf 90KB Plugin to query tables scraped from PDF files.
...

P.S. We'd be persisting with a monolithic Git repo containing multiple "projects" here, but I personally don't mind mono repos.

@jnturton
Copy link
Contributor

jnturton commented Nov 8, 2021

@cgivre Instead of generating field_0, field_1 when no column names can be determined, is it possible to generate a columns array to be accessed using columns[0], columns[1], ... to remain consistent with what we do for CSV?

@cgivre
Copy link
Contributor Author

cgivre commented Nov 8, 2021

@cgivre Instead of generating field_0, field_1 when no column names can be determined, is it possible to generate a columns array to be accessed using columns[0], columns[1], ... to remain consistent with what we do for CSV?

It's possible to do that, but a few things:

  1. Sometimes the PDF reader does not read tables perfectly and you get a mix of found headers and not found headers, so that's one reason I took that approach.
  2. I actually dislike the columns approach from the CSV readers because it increases the level of complexity of queries. In theory, if someone is querying a table (doesn't matter from where) they will want that broken into columns and rows. The columns array approach (IMHO) makes this a lot harder that it needs to be.
  3. This actually follows the model used in the Excel reader.

@cgivre
Copy link
Contributor Author

cgivre commented Nov 8, 2021

@dzamo, the reference was to an early Saturday Night Live skit.

Just to refocus the discussion, my question is really around configuration. When querying CSV, Parquet and the like, it is very clear where the data lies. When querying Excel, PDF, HTML, Word, "the web", it is less clear: there is some amount of data mining needed to say that it is THIS table and not THAT one, that the table spans three pages, etc.

The question is, how does the user specify this? If it were me, I would not want to be tinkering with a JSON storage plugin config, watching my query fail with a Drill stack trace, or wondering why I got no data. Instead, I'd want a tool better suited for the task. Once I had that, if I then wanted to run at scale, I'd want to say, "Drill, just use X and consume the data."

So, the question here is: is the JSON storage plugin config an effective way for a data scientist to mine PDF, Excel, HTML and other messy sources? I don't have an answer, I'm just asking the question.

Again, if we had the ability to have true external plugins, then this could easily be an add-on project. Those who know PDF could go hog wild creating a good solution. But, we don't, so every specialized plugin has to be part of core Drill. Is that good? That's also a question for debate.

@dzamo @paul-rogers , These are all interesting points. I'd like to focus on the config variables for a moment. The PDF reader has 2 config variables: the mergeAllTables and the tableNumber. While a user could theoretically set these globally, they really are intended to be set via the table() function at query time. For some reference the Excel reader has similar configs which allow a user to select different sheets within an Excel file, or define regions in a file where their data lives etc.

IMHO, this flexibility is actually very good for the user, because it allows an administrator to configure global default values that make sense but also allow a user to make query-time changes so they can access their data.

With respect to the fixed-width plugin (#2282) I actually have a different vision of how this can be used, and will post comments there.

@jnturton
Copy link
Contributor

jnturton commented Nov 8, 2021

  • Sometimes the PDF reader does not read tables perfectly and you get a mix of found headers and not found headers, so that's one reason I took that approach.

This consideration will only apply for extractHeaders = true, right? In this case all of the readers do split out columns so I think we're good.

  • I actually dislike the columns approach from the CSV readers because it increases the level of complexity of queries. In theory, if someone is querying a table (doesn't matter from where) they will want that broken into columns and rows. The columns array approach (IMHO) makes this a lot harder that it needs to be.

The columns array does allow text files to contain jagged arrays, which is perhaps valuable? I don't see a huge saving in typing select field_0, field_1 over select columns[0], columns[1], am I missing some other complication?

  • This actually follows the model used in the Excel reader.

My fear, and I don't know if this is real or not, is that if we profilerate plugin-specific quirks in how the schema is represented to the user then the promise of a standard SQL interface to the data gets tainted and developers and users will be put off.

"It's standard except every plugin presents its data the way that author prefers"
"That fragment of code I sent you for the columns array won't work here because this plugin does it differently"
"This SQL script makes for confusing reading because the plugins involved name their generated columns differently, sometimes 'column', sometimes 'field', sometimes 'var'"
etc.

I don't feel all that strongly about field_0 vs columns[0] (I mean, maybe we deprecate the columns array?) but I am finding myself thinking more and more about consistency.

@cgivre
Copy link
Contributor Author

cgivre commented Nov 10, 2021

  • Sometimes the PDF reader does not read tables perfectly and you get a mix of found headers and not found headers, so that's one reason I took that approach.

This consideration will only apply for extractHeaders = true, right? In this case all of the readers do split out columns so I think we're good.

  • I actually dislike the columns approach from the CSV readers because it increases the level of complexity of queries. In theory, if someone is querying a table (doesn't matter from where) they will want that broken into columns and rows. The columns array approach (IMHO) makes this a lot harder that it needs to be.

The columns array does allow text files to contain jagged arrays, which is perhaps valuable? I don't see a huge saving in typing select field_0, field_1 over select columns[0], columns[1], am I missing some other complication?

  • This actually follows the model used in the Excel reader.

My fear, and I don't know if this is real or not, is that if we profilerate plugin-specific quirks in how the schema is represented to the user then the promise of a standard SQL interface to the data gets tainted and developers and users will be put off.

"It's standard except every plugin presents its data the way that author prefers" "That fragment of code I sent you for the columns array won't work here because this plugin does it differently" "This SQL script makes for confusing reading because the plugins involved name their generated columns differently, sometimes 'column', sometimes 'field', sometimes 'var'" etc.

I don't feel all that strongly about field_0 vs columns[0] (I mean, maybe we deprecate the columns array?) but I am finding myself thinking more and more about consistency.

The bigger implication of having a columns array vs field_n is when a user starts with SELECT * queries. It makes it harder for BI tools to gather schema metadata and it also is non-standard SQL. Now... querying PDF is also non-standard SQL... so maybe that's less important. But, it makes the discovery a little harder IMHO.

@jnturton
Copy link
Contributor

The bigger implication of having a columns array vs field_n is when a user starts with SELECT * queries. It makes it harder for BI tools to gather schema metadata and it also is non-standard SQL. Now... querying PDF is also non-standard SQL... so maybe that's less important. But, it makes the discovery a little harder IMHO.

@cgivre Oh right, that makes sense. So should we put in support for both columns[n] and field_n as widely as possible, with a standardised option which lets users switch between each mode? Maybe standardising on naming of columns[n] and column_n is a small saving on cognitive load for users here?

@paul-rogers with apologies to this PR for saddling it with so much broader design chat, I wanted to share a last set of findings from talking with others with you and finally ask if we might go over a couple of questions with you, away from this PR.

  1. I've polled some Drill devs and going after the "long tail" of formats and storage systems is mostly of interest to them. @vvysotskyi even has an intriguing idea of a marketplace for these plugins, I guess something like the Eclipse plugin marketplace.
  2. I have developed a conviction that to go after the "long tail" and not produce a sprawling mess that neither developers nor users want to touch, we need to try to get strict (to the extent possible) about consistency in how plugins behave and how they are configured. Today we already are not all that consistent (e.g. see remarks on columns[n] vs field_n above, on column name and type in fixed width format).
  3. Those I've spoken with do also like the idea of splitting our distributed packages into "core" and "kitchen sink", or something like that, to put us in a better position to go after the "long tail". It sounds like we're okay with our existing mono repo containing many plugins but end users should not have to download the kitchen sink to query e.g. just JSON or Parquet. Drill startup times will probably be slow for the kitchen sink because the Java class loader will have a huge amount to scan. And developer testing could get onerous if we cannot compile only a subset.
  4. By chance I saw that BigQuery, which for some reason I've designated in my mind of as kind of the Rolls Royce of Dremel-family engines even though I know little about it, can query Google Sheets. So even they entertain some "small data" formats, although nothing like what we're imagining. Just an anecdote.

I would love to consult with you on 2 and 3 in a sort of "Very well, if you must do a distribution of Drill with this long tail of formats, storage systems and UDFs in it then at least equip yourselves with the following practices" chat. Perhaps in the upcoming community meetup, otherwise outside (if it's of any interest on your end of course).

Thanks
James

Copy link
Member

@luocooong luocooong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre It is a good solution for drill the table data in PDF. Please take note of my comments.

contrib/format-pdf/src/test/resources/logback-test.xml Outdated Show resolved Hide resolved

@Override
public boolean open(FileScanFramework.FileSchemaNegotiator negotiator) {
System.setProperty("java.awt.headless", "true");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property has been set many times, maybe it can be done like this :

if (System.getProperty("java.awt.headless") == null) {
  System.setProperty("java.awt.headless", "true");
}

And remove the Utils.java, line 60. PdfBatchReader.java, line 164?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my. Drill can run embedded. Suppose I'm running Drill embedded in my AWT application. Will this break things?

Drill is multi-threaded: we an have a dozen readers all setting this same system property concurrently with other code reading it. Is this a safe thing to do?

More fundamentally, should we be mucking about with shared system properties in a reader? Probably not.

// Check to see if the limit has been reached
if (rowWriter.limitReached(maxRecords)) {
return false;
} else if (currentRowIndex >= currentTable.getRows().size() &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the logic to handle a table that spans multiple pages?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, I'm a bit confused by the comparison of row count and table count. Imagine each table has 100K rows. It would have to split across Drill batches. And, if each table has 3 rows, all tables should be combined in a single Drill batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned this up by wrapping this in iterator classes. This logic was here to support the page merge feature. Basically... it works like this... a Page can contain 0 or more Tables. If you take a look at the included file schools.pdf you'll see it is a 5 page spreadsheet. The problem is that if you wanted to actually query this data, you'd want the whole thing, not an individual page. So, what Drill can do is merge all the Pages (each containing 1 table) together.

You would only want to use this feature in the case of multi-page spreadsheet PDFs. The logic therefore is:

  1. Read the current table until there are no more rows.
  2. Get the next table (if there is one)
  3. Read it until there are no more rows
  4. Go back to step 2.

@paul-rogers
Copy link
Contributor

@dzamo, @cgivre: we should certainly encourage people to add plugins (connectors) to everything under the sun. The question really is: does the rather limited Drill core team want to support all of them at the same level as the main ones that Just Gotta Work. For huge files. On local, S3, HDFS. With splits. With metadata. With excellent error reporting. And on and on.

The PDF thing seems like the prototypical "not in the core, but a great add-on" example. Yes, some one user might want to scan zillions of PDF files on an HDFS system. But, the core Drill team can't spend its resources on that when there are still basics to address.

Let's do this: let's get this into Drill for now. Then:

  • Ensure we have a good way to build plugins separate from the Drill code. (Last time I tried, I couldn't get it to work.)
  • Explain how to create a plugin in a users own repo, built against Drill.
  • Explain how to drop the plugin into a running Drill to add that functionality.

This way, even without a marketplace, folks can build their own connectors separate from the Drill repo.

A next step might be to split Drill's monorepo into multiple parts, as do many other projects. The current repo would be the core. Then, the contrib stuff could move to its own repo(s) under the Drill project. And, users can create their own extensions in their own repos.

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple comments: some architectural, some nits.

There are many example PDF files in the test directory. Not clear from their names what they are used for. Maybe add a "description.txt" or some such to explain how we use each for testing. That is:

  • china.pdf: Verifies that Unicode (Chinese) characters are read correctly.
  • ...

contrib/format-pdf/README.md Outdated Show resolved Hide resolved
contrib/format-pdf/README.md Outdated Show resolved Hide resolved

### Merging Pages
The PDF reader reads tables from PDF files on each page. If your PDF file has tables that span multiple pages, you can set the `combinePages` parameter to `true` and Drill
will merge all the tables in the PDF file. You can also do this at query time with the `table()` function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose my file has three, mutually incompatible tables, and I only want to read the first. Can I? If so, how?

Or, do I read all of them (sales of apples by state, shelf life of various apple kinds, list of largest apple growers) into a big messy, combined table, then use a WHERE clause to try to keep just the shelf life info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I envisioned this working was for a user to use the table() function and specify the table index at query time. That way they can query as many incompatible tables as they want and not have to read ones they don't.

contrib/format-pdf/README.md Outdated Show resolved Hide resolved
}
```
The avaialable options are:
* `extractHeaders`: Extracts the first row of any tables as the header row. If set to false, Drill will assign column names of `field_0`, `field_1` to each column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain a bit more. I have two tables with titles "a | b", and "c | d | e". Do I get a union row with "a, b, c, d, e" as fields? Or, just "a, b" as fields? When I read the second table, do I then get "a, b, e"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on a per-table basis. Really it works the same way the CSV reader would work if you tried to query multiple files in a directory.
In the example you cited, you'd get a table with columns, a | b | c | d | e. The rows from table 1 would include null entries for the columns c | d | e and the rows from table 2 would contain nulls for columns a | b.

List<String> values = new ArrayList<>();

if (firstRow != null) {
for (int i =0; i < firstRow.size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: space after =.

What happens if the rows have different sizes? Is that possible if we guessed wrong about the table? If the table merges cells? How should we handle such issues? Do we have tests for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't test for all those use cases, but the underlying library (Tabula) actually has a pretty robust collection of tests that do cover a lot of what you're asking about.1

In my manual testing, what I found was that if you had inconsistent tables, Tabula would basically produce some sort of mapping to columns and rows that Drill could read. Not ideal, but neither is trying to extract data from PDFs.

contrib/format-pdf/src/test/resources/logback-test.xml Outdated Show resolved Hide resolved
contrib/format-pdf/src/test/resources/logback-test.xml Outdated Show resolved Hide resolved
@jnturton
Copy link
Contributor

Copy that @paul-rogers, I have incorporated your advice above into the Drill 2.0 wiki page. I am hopeful that after 1.20 is out the door we can split out the new branches, and repos, for 2.0 (but it is of course the project elders that decide such things). In my youthful exuberance I imagine it starting early next year!

I would think that getting the source code reorganised in a way that will carry us into the future should probably be the very first Drill 2.0 activity to take place. Surely we would want our most experienced contributors to define such a task.

@jnturton
Copy link
Contributor

jnturton commented Dec 1, 2021

Dear PR author and reviewers.

This is a generic message to say that we would like to merge this PR in time for the 1.20 release. Currently we're targeting a master branch freeze date of 2021-12-10 (10 Dec). Please strive to complete development and review by this time, or indicate that the PR will need more time (and how much).

Thank you.

@cgivre cgivre force-pushed the format-pdf branch 2 times, most recently from b62648a to 80bb4fd Compare December 10, 2021 01:03
@cgivre
Copy link
Contributor Author

cgivre commented Dec 21, 2021

@cgivre new pdfbox release: https://downloads.apache.org/pdfbox/2.0.25/RELEASE-NOTES.txt

Thanks @pjfanning ! Unfortunately, that didn't solve the issue with opening up the java window. I emailed the PDFbox user list and hopefully they can give me some suggestions.

@cgivre
Copy link
Contributor Author

cgivre commented Dec 22, 2021

@paul-rogers
This PR is now ready for review. I did some research into the headless option and why specifically I was not getting this issue when I ran unit tests, and only when I ran manual queries. So, the issue was caused when PDFBox opens a document, it calls some AWT libraries that it needs for deciphering PDF files. (Also probably for writing PDF files) In any event, this is what was causing the window to open.

When Drill runs the unit tests, we actually have the headless option specified as a Java option. So, I moved this setting to drill-config.sh and included a comment about why it is there. If someone wants to remove it for some reason, they can and they can disable the PDF functionality simply by not including it in the configs for the storage plugin. This seems like the right place for this to live.

I believe that I've addressed all your other comments as well. Thank you as always for the thorough review! It's ready for another look.

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not convinced that this should be part of the core Drill code: reading PDF files pushes Drill in the direction of a small-scale, query-anything tool rather than a large-scale distributed query tool. Folks won't, at the same time, run 100 Drill nodes and muck with PDF files: PDF is a horrible format to use for data at a scale.

Further, I'm not convinced the SQL and JSON are the best format for working out how to parse a PDF file. If the user is naive enough to use PDF as their input format, they're not going to be savvy enough to figure out the ad-hoc, clunky way they have to configure stuff using SQL and JSON.

That said, that conversation was already had, and we're proceeding anyway. Given that, the code look good. So, +1.


while(!rowWriter.isFull()) {
// Check to see if the limit has been reached
if (rowWriter.limitReached(maxRecords)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd forgotten this comment. Limit was added to the EVF framework in the pending EVF V2 PR.

@cgivre cgivre merged commit e3150e3 into apache:master Jan 9, 2022
@cgivre cgivre deleted the format-pdf branch January 9, 2022 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-impacting PRs that affect the documentation enhancement PRs that add a new functionality to Drill new-format New Format Plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants