The explosion of graph data in social and biological networks, recommendation systems, provenance databases, etc. makes graph storage and processing of paramount importance.
PIG is a new graph benchmarking framework, which provides both a methodology for evaluating graph database performance and a mechanism to carry out such evaluations. It takes a hierarchical approach to benchmarking. The suite has three layers of benchmarks:
- Primitive operations such as reading and writing vertices and edges
- Composite access patterns such as extracting k-hop neighborhoods
- Graph algorithms such as shortest path finding and computing centrality metrics
This framework allows for comparisons between systems as well as single system introspection. Such introspection allows one to evaluate the degree to which systems exploit their knowledge of graph access patterns. The suite also comes with a web interface that makes it easy to run benchmarks and to visualize and analyze the collected data.
To run PIG, you will need:
- Linux, Java 1.6 or newer (JRockit JVM is highly recommended), Maven
- Blueprints Extensions
- Core Provenance Library
After installing all the prerequisites and checking out the source code of PIG, cd into the graphdb-bench directory and type:
mvn install
You can then start the web interface using:
./runWebInterfaceServer.sh
This will start a server on port 8080. Or you can run the benchmark tools directly from the command-line using:
./runBenchmarkSuite.sh`
Use the --help
option to get the list of available commands or +help to see advanced options and options for configuring the JVM.
To edit the configuration of PIG, please edit the following file:
graphdb-bench/src/main/resources/com/tinkerpop/bench/bench.properties`
You can also override many options using command-line arguments and/or the web interface.
You can generate your own datasets using fgftool distributed as a part of Blueprints Extensions (one of the prerequisites of PIG). You can also download datasets with up to 1 million nodes from here:
https://drive.google.com/folderview?id=0B3jkRHQ7nKvnbDhsWHBySVV6VVk&usp=sharing
Place the datasets in the directory specified in the configuration file. The default is data/datasets
in the project directory.
- Peter Macko, Daniel Margo, and Margo Seltzer. Performance Introspection of Graph Databases. 6th International Systems and Storage Conference (SYSTOR '13), Haifa, Israel, June 2013. (pdf)