Popular columnar data storage format, widely used in Big Data and Analytics.
- columnar
- efficient for column specific queries
- optimized for reads, more write overhead due to buffering (RAM+CPU)
- each data file contains the values for a set of rows
- schema evolution - limited - can only add columns at the end
- faster than ORC
- compression, different algos for different columns, eg. one type for string, another for numbers
- compression not as good as ORC but slightly faster
- widely used by many systems:
- Databases-like systems, MPP, distributed SQL:
- Data processing frameworks:
- Spark
- Flink
- Pandas (via PyArrow or FastParquet)
- Tensorflow
- Cloud database warehousing platforms:
Download and install Parquet Tools from here.
Or using script from DevOps-Bash-tools which automatically determines latest version if no version is specified as the first arg:
download_parquet_tools.sh # "$version"
Run Parquet Tools jar, downloading it if not already present:
parquet_tools.sh <command>
You can run the jar directly, it's just a longer command:
PARQUET_VERSION="1.11.2"
java -jar "parquet-tools-$PARQUET_VERSION.jar" <command>
Then run the commands:
parquet_tools.sh cat
parquet_tools.sh dump
parquet_tools.sh head
parquet_tools.sh meta
parquet_tools.sh schema
Hive creates parquet files as:
/apps/hive/warehouse/bicore.db/auditlogs_parquet/000000_0
rather than:
blah.parquet
import pyarrow as pq
pq.read_table('file.parquet') # nthreads=4 optional argument
pr.ParquestDataset('myDir/') # nthread=4
Does not tolerate of errors though. See validate_parquet.py
below.
Get it from DevOps-Python-tools:
validate_parquet.py "$parquet_file" "$parquet_file2" ...
Recursively find and validate all parquet files in all directories under the current directory:
validate_parquet.py .
Ported from private Knowledge Base page 2014+