Apache Parquet

Popular columnar data storage format, widely used in Big Data and Analytics.

Key Points
Parquet Tools
- Download
- Run
- Commands
Hive Parquet Output
Python Library - PyArrow
validate_parquet.py

Key Points

columnar
efficient for column specific queries
optimized for reads, more write overhead due to buffering (RAM+CPU)
each data file contains the values for a set of rows
schema evolution - limited - can only add columns at the end
faster than ORC
compression, different algos for different columns, eg. one type for string, another for numbers
compression not as good as ORC but slightly faster
widely used by many systems:
- Databases-like systems, MPP, distributed SQL:
  - Hive
  - Impala
  - Presto
  - Apache Drill
  - Trino
- Data processing frameworks:
  - Spark
  - Flink
  - Pandas (via PyArrow or FastParquet)
  - Tensorflow
- Cloud database warehousing platforms:

Parquet Tools

Download

Download and install Parquet Tools from here.

Or using script from DevOps-Bash-tools which automatically determines latest version if no version is specified as the first arg:

download_parquet_tools.sh  # "$version"

Run

Run Parquet Tools jar, downloading it if not already present:

parquet_tools.sh <command>

You can run the jar directly, it's just a longer command:

PARQUET_VERSION="1.11.2"

java -jar "parquet-tools-$PARQUET_VERSION.jar" <command>

Then run the commands:

Commands

parquet_tools.sh cat

parquet_tools.sh dump

parquet_tools.sh head

parquet_tools.sh meta

parquet_tools.sh schema

Hive Parquet Output

Hive creates parquet files as:

/apps/hive/warehouse/bicore.db/auditlogs_parquet/000000_0

rather than:

blah.parquet

Python Library - PyArrow

import pyarrow as pq
pq.read_table('file.parquet') # nthreads=4 optional argument
pr.ParquestDataset('myDir/') # nthread=4

Does not tolerate of errors though. See validate_parquet.py below.

validate_parquet.py

Get it from DevOps-Python-tools:

validate_parquet.py "$parquet_file" "$parquet_file2" ...

Recursively find and validate all parquet files in all directories under the current directory:

validate_parquet.py .

Ported from private Knowledge Base page 2014+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet.md

parquet.md

Apache Parquet

Key Points

Parquet Tools

Download

Run

Commands

Hive Parquet Output

Python Library - PyArrow

validate_parquet.py

Files

parquet.md

Latest commit

History

parquet.md

File metadata and controls

Apache Parquet

Key Points

Parquet Tools

Download

Run

Commands

Hive Parquet Output

Python Library - PyArrow

validate_parquet.py