Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8028: Add PDF Format Plugin #2359

Merged
merged 28 commits into from
Jan 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions contrib/format-pdf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Format Plugin for PDF Table Reader
One of the most annoying tasks is when you are working on a data science project and you get data that is in a PDF file. This plugin endeavours to enable you to query data in PDF tables using Drill's SQL interface.

## Data Model
Since PDF files generally are not intended to be queried or read by machines, mapping the data to tables and rows is not a perfect process. The PDF reader does support
provided schema. You can read about Drill's [provided schema functionality here](https://drill.apache.org/docs/plugin-configuration-basics/#specifying-the-schema-as-table-function-parameter)


### Merging Pages
The PDF reader reads tables from PDF files on each page. If your PDF file has tables that span multiple pages, you can set the `combinePages` parameter to `true` and Drill
will merge all the tables in the PDF file. You can also do this at query time with the `table()` function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose my file has three, mutually incompatible tables, and I only want to read the first. Can I? If so, how?

Or, do I read all of them (sales of apples by state, shelf life of various apple kinds, list of largest apple growers) into a big messy, combined table, then use a WHERE clause to try to keep just the shelf life info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I envisioned this working was for a user to use the table() function and specify the table index at query time. That way they can query as many incompatible tables as they want and not have to read ones they don't.


## Configuration
To configure the PDF reader, simply add the information below to the `formats` section of a file based storage plugin, such as `dfs`, `hdfs` or `s3`.

```json
"pdf": {
"type": "pdf",
"extensions": [
"pdf"
],
"extractionAlgorithm": "spreadsheet",
"extractHeaders": true,
"combinePages": false
}
```
The available options are:
* `extractHeaders`: Extracts the first row of any tables as the header row. If set to `false`, Drill will assign column names of `field_0`, `field_1` to each column.
* `combinePages`: Merges multi page tables together.
* `defaultTableIndex`: Allows you to query different tables within the PDF file. Index starts at `1`.
* `extractionAlgorithm`: Allows you to choose the extraction algorithm used for extracting data from the PDF file. Choices are `spreadsheet` and `basic`. Depending on your data, one may work better than the other.

## Accessing Document Metadata Fields
PDF files have a considerable amount of metadata which can be useful for analysis. Drill will extract the following fields from every PDF file. Note that these fields are not projected in star queries and must be selected explicitly. The document's creator populates these fields and some or all may be empty. With the exception of `_page_count` which is an `INT` and the two date fields, all the other fields are `VARCHAR` fields.

The fields are:
* `_page_count`
* `_author`
* `_title`
* `_keywords`
* `_creator`
* `_producer`
* `_creation_date`
* `_modification_date`
* `_trapped`
* `_table_count`
cgivre marked this conversation as resolved.
Show resolved Hide resolved

The query below will access a document's metadata:

```sql
SELECT _page_count, _title, _author, _subject,
_keywords, _creator, _producer, _creation_date,
_modification_date, _trapped
FROM dfs.`pdf/20.pdf`
```
The query below demonstrates how to define a schema at query time:

```sql
SELECT * FROM table(cp.`pdf/schools.pdf` (type => 'pdf', combinePages => true,
schema => 'inline=(`Last Name` VARCHAR, `First Name Address` VARCHAR,
`field_0` VARCHAR, `City` VARCHAR, `State` VARCHAR, `Zip` VARCHAR,
`field_1` VARCHAR, `Occupation Employer` VARCHAR,
`Date` VARCHAR, `field_2` DATE properties {`drill.format` = `M/d/yyyy`},
`Amount` DOUBLE)'))
LIMIT 5
```

### Encrypted Files
If a PDF file is encrypted, you can supply the password to the file via the `table()` function as shown below. Note that the password will be recorded in any query logs that
may exist.

```sql
SELECT *
FROM table(dfs.`encrypted_pdf.pdf`(type => 'pdf', password=> 'your_password'))
```
105 changes: 105 additions & 0 deletions contrib/format-pdf/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
<?xml version="1.0"?>
<!--

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
<artifactId>drill-contrib-parent</artifactId>
<groupId>org.apache.drill.contrib</groupId>
<version>1.20.0-SNAPSHOT</version>
</parent>

<artifactId>drill-format-pdf</artifactId>
<name>Drill : Contrib : Format : PDF</name>

<dependencies>
<dependency>
<groupId>org.apache.drill.exec</groupId>
<artifactId>drill-java-exec</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>technology.tabula</groupId>
<artifactId>tabula</artifactId>
<version>1.0.5</version>
<exclusions>
<exclusion>
<artifactId>slf4j-simple</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.25</version>
<exclusions>
<exclusion>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- Test dependencies -->
<dependency>
<groupId>org.apache.drill.exec</groupId>
<artifactId>drill-java-exec</artifactId>
<classifier>tests</classifier>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.drill</groupId>
<artifactId>drill-common</artifactId>
<classifier>tests</classifier>
<version>${project.version}</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<executions>
<execution>
<id>copy-java-sources</id>
<phase>process-sources</phase>
<goals>
<goal>copy-resources</goal>
</goals>
<configuration>
<outputDirectory>${basedir}/target/classes/org/apache/drill/exec/store/pdf
</outputDirectory>
<resources>
<resource>
<directory>src/main/java/org/apache/drill/exec/store/pdf</directory>
<filtering>true</filtering>
</resource>
</resources>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Loading