-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8028: Add PDF Format Plugin #2359
Merged
Merged
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
47963d8
Initial commit
cgivre cfbe020
WIP
cgivre 7482188
Regular queries working
cgivre 14a9318
Metadata fields working
cgivre ef6f303
Minor fixes
cgivre c7a6b78
Fixed unit test
cgivre e451aa3
Added additional closing functions.
cgivre 68f474e
WIP
cgivre db4baec
Fixed Headless Issue
cgivre c7ab0b4
Updated to Drill 1.20
cgivre e189d96
Added option to merge pages
cgivre d807179
Ready for PR
cgivre 6fe5689
Removed struts
cgivre 5f0c8ac
WIP
cgivre 87d896d
Progress..
cgivre f2d9242
UTs all passing
cgivre 19c66da
Fix Duplicate Page Issue
cgivre 4b1cd16
Fixed extract headers
cgivre d2a06c7
Refactored Tables and Added Metadata class
cgivre fe0a86b
Added UT
cgivre 511b5a0
Code cleanup
cgivre 3848ede
New UTs
cgivre 0082640
Added UTs
cgivre 86ad1dd
Added UT and removed extra test files
cgivre 0db9374
Removed comment
cgivre 4f2d44f
Removed comment
cgivre 4cf4d41
Bump pdfbox to latest version
cgivre b3c66a1
Moved Java config to drill-config.sh
cgivre File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# Format Plugin for PDF Table Reader | ||
One of the most annoying tasks is when you are working on a data science project and you get data that is in a PDF file. This plugin endeavours to enable you to query data in PDF tables using Drill's SQL interface. | ||
|
||
## Data Model | ||
Since PDF files generally are not intended to be queried or read by machines, mapping the data to tables and rows is not a perfect process. The PDF reader does support | ||
provided schema. You can read about Drill's [provided schema functionality here](https://drill.apache.org/docs/plugin-configuration-basics/#specifying-the-schema-as-table-function-parameter) | ||
|
||
|
||
### Merging Pages | ||
The PDF reader reads tables from PDF files on each page. If your PDF file has tables that span multiple pages, you can set the `combinePages` parameter to `true` and Drill | ||
will merge all the tables in the PDF file. You can also do this at query time with the `table()` function. | ||
|
||
## Configuration | ||
To configure the PDF reader, simply add the information below to the `formats` section of a file based storage plugin, such as `dfs`, `hdfs` or `s3`. | ||
|
||
```json | ||
"pdf": { | ||
"type": "pdf", | ||
"extensions": [ | ||
"pdf" | ||
], | ||
"extractionAlgorithm": "spreadsheet", | ||
"extractHeaders": true, | ||
"combinePages": false | ||
} | ||
``` | ||
The available options are: | ||
* `extractHeaders`: Extracts the first row of any tables as the header row. If set to `false`, Drill will assign column names of `field_0`, `field_1` to each column. | ||
* `combinePages`: Merges multi page tables together. | ||
* `defaultTableIndex`: Allows you to query different tables within the PDF file. Index starts at `1`. | ||
* `extractionAlgorithm`: Allows you to choose the extraction algorithm used for extracting data from the PDF file. Choices are `spreadsheet` and `basic`. Depending on your data, one may work better than the other. | ||
|
||
## Accessing Document Metadata Fields | ||
PDF files have a considerable amount of metadata which can be useful for analysis. Drill will extract the following fields from every PDF file. Note that these fields are not projected in star queries and must be selected explicitly. The document's creator populates these fields and some or all may be empty. With the exception of `_page_count` which is an `INT` and the two date fields, all the other fields are `VARCHAR` fields. | ||
|
||
The fields are: | ||
* `_page_count` | ||
* `_author` | ||
* `_title` | ||
* `_keywords` | ||
* `_creator` | ||
* `_producer` | ||
* `_creation_date` | ||
* `_modification_date` | ||
* `_trapped` | ||
* `_table_count` | ||
cgivre marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The query below will access a document's metadata: | ||
|
||
```sql | ||
SELECT _page_count, _title, _author, _subject, | ||
_keywords, _creator, _producer, _creation_date, | ||
_modification_date, _trapped | ||
FROM dfs.`pdf/20.pdf` | ||
``` | ||
The query below demonstrates how to define a schema at query time: | ||
|
||
```sql | ||
SELECT * FROM table(cp.`pdf/schools.pdf` (type => 'pdf', combinePages => true, | ||
schema => 'inline=(`Last Name` VARCHAR, `First Name Address` VARCHAR, | ||
`field_0` VARCHAR, `City` VARCHAR, `State` VARCHAR, `Zip` VARCHAR, | ||
`field_1` VARCHAR, `Occupation Employer` VARCHAR, | ||
`Date` VARCHAR, `field_2` DATE properties {`drill.format` = `M/d/yyyy`}, | ||
`Amount` DOUBLE)')) | ||
LIMIT 5 | ||
``` | ||
|
||
### Encrypted Files | ||
If a PDF file is encrypted, you can supply the password to the file via the `table()` function as shown below. Note that the password will be recorded in any query logs that | ||
may exist. | ||
|
||
```sql | ||
SELECT * | ||
FROM table(dfs.`encrypted_pdf.pdf`(type => 'pdf', password=> 'your_password')) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
<?xml version="1.0"?> | ||
<!-- | ||
|
||
Licensed to the Apache Software Foundation (ASF) under one | ||
or more contributor license agreements. See the NOTICE file | ||
distributed with this work for additional information | ||
regarding copyright ownership. The ASF licenses this file | ||
to you under the Apache License, Version 2.0 (the | ||
"License"); you may not use this file except in compliance | ||
with the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
|
||
--> | ||
<project xmlns="http://maven.apache.org/POM/4.0.0" | ||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | ||
<modelVersion>4.0.0</modelVersion> | ||
|
||
<parent> | ||
<artifactId>drill-contrib-parent</artifactId> | ||
<groupId>org.apache.drill.contrib</groupId> | ||
<version>1.20.0-SNAPSHOT</version> | ||
</parent> | ||
|
||
<artifactId>drill-format-pdf</artifactId> | ||
<name>Drill : Contrib : Format : PDF</name> | ||
|
||
<dependencies> | ||
<dependency> | ||
<groupId>org.apache.drill.exec</groupId> | ||
<artifactId>drill-java-exec</artifactId> | ||
<version>${project.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>technology.tabula</groupId> | ||
<artifactId>tabula</artifactId> | ||
<version>1.0.5</version> | ||
<exclusions> | ||
<exclusion> | ||
<artifactId>slf4j-simple</artifactId> | ||
<groupId>org.slf4j</groupId> | ||
</exclusion> | ||
</exclusions> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.apache.pdfbox</groupId> | ||
<artifactId>pdfbox</artifactId> | ||
<version>2.0.25</version> | ||
<exclusions> | ||
<exclusion> | ||
<groupId>commons-logging</groupId> | ||
<artifactId>commons-logging</artifactId> | ||
</exclusion> | ||
</exclusions> | ||
</dependency> | ||
<!-- Test dependencies --> | ||
<dependency> | ||
<groupId>org.apache.drill.exec</groupId> | ||
<artifactId>drill-java-exec</artifactId> | ||
<classifier>tests</classifier> | ||
<version>${project.version}</version> | ||
<scope>test</scope> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.apache.drill</groupId> | ||
<artifactId>drill-common</artifactId> | ||
<classifier>tests</classifier> | ||
<version>${project.version}</version> | ||
<scope>test</scope> | ||
</dependency> | ||
</dependencies> | ||
<build> | ||
<plugins> | ||
<plugin> | ||
<artifactId>maven-resources-plugin</artifactId> | ||
<executions> | ||
<execution> | ||
<id>copy-java-sources</id> | ||
<phase>process-sources</phase> | ||
<goals> | ||
<goal>copy-resources</goal> | ||
</goals> | ||
<configuration> | ||
<outputDirectory>${basedir}/target/classes/org/apache/drill/exec/store/pdf | ||
</outputDirectory> | ||
<resources> | ||
<resource> | ||
<directory>src/main/java/org/apache/drill/exec/store/pdf</directory> | ||
<filtering>true</filtering> | ||
</resource> | ||
</resources> | ||
</configuration> | ||
</execution> | ||
</executions> | ||
</plugin> | ||
</plugins> | ||
</build> | ||
</project> |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose my file has three, mutually incompatible tables, and I only want to read the first. Can I? If so, how?
Or, do I read all of them (sales of apples by state, shelf life of various apple kinds, list of largest apple growers) into a big messy, combined table, then use a
WHERE
clause to try to keep just the shelf life info?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I envisioned this working was for a user to use the
table()
function and specify the table index at query time. That way they can query as many incompatible tables as they want and not have to read ones they don't.