Skip to content

Latest commit

 

History

History
196 lines (143 loc) · 4.06 KB

parquet.md

File metadata and controls

196 lines (143 loc) · 4.06 KB

Apache Parquet

Popular columnar data storage format, widely used in Big Data and Analytics.

Key Points

Parquet Tools

Download

Download and install Parquet Tools from here.

Or using script from DevOps-Bash-tools which automatically determines latest version if no version is specified as the first arg:

download_parquet_tools.sh  # "$version"

Run

Run Parquet Tools jar, downloading it if not already present:

parquet_tools.sh <command>

You can run the jar directly, it's just a longer command:

PARQUET_VERSION="1.11.2"
java -jar "parquet-tools-$PARQUET_VERSION.jar" <command>

Then run the commands:

Commands

parquet_tools.sh cat
parquet_tools.sh dump
parquet_tools.sh head
parquet_tools.sh meta
parquet_tools.sh schema

Hive Parquet Output

Hive creates parquet files as:

/apps/hive/warehouse/bicore.db/auditlogs_parquet/000000_0

rather than:

blah.parquet

Python Library - PyArrow

import pyarrow as pq
pq.read_table('file.parquet') # nthreads=4 optional argument
pr.ParquestDataset('myDir/') # nthread=4

Does not tolerate of errors though. See validate_parquet.py below.

validate_parquet.py

Get it from DevOps-Python-tools:

validate_parquet.py "$parquet_file" "$parquet_file2" ...

Recursively find and validate all parquet files in all directories under the current directory:

validate_parquet.py .

Ported from private Knowledge Base page 2014+