#Cassandra TPC-DS
To use this, you need to also clone the tpcds-kit GitHub repo, as we'll use that to generate the data. It is expected that this repo and the tpcds-kit repo are both cloned from the same directory, resulting in a tpcds-kit/ and cassandra-tpcds/ directory in the same directory. If you do not do that, you will need to update the Makefile appropriately.
STEPS:
- clone this repo:
git clone
- clone the tpcds-kit repo:
git clone https://github.com/grahn/tpcds-kit.git
- Build the tpcds-kit binaries
cd tpcds-kit/tools
make -f Makefile.suite
- Build the Java loader
cd ../../cassandra-tpcds
make compile
- Make the data. You will need to update the Makefile with the desired TPCDS_SCALE_FACTOR. It ships with the value 1, which is only for development purposes.
You will need to update the IP_ADDR in Makefile to be one of the endpoints for your Cassandra cluster.
make data
- Create the keyspace and tables in Cassandra
make ddl
- Load the data into Cassandra
make loadall
##Queries The queries are Shark queries taken from the impala-tpcds-kit GitHub repo: https://github.com/cloudera/impala-tpcds-kit/tree/master/queries-sql92-modified/queries/shark
They are queries 3, 7, 19, 27, 34, 42, 43, 46, 52, 53, 55, 59, 63, 65, 68, 73, 79, 89, 98, and ss_max.
A version which has been slightly modified to reference database names, not just bare table names, is in queries_full.
To use these queries, run
dse spark
use tpcds;
source queries/q19.sql;
You can then do whichever query you want from within the Shark shell.