This is the homework for PingCAP Talent Plan Online of week 4. This homework is a simplified version of ACM SIGMOD Programming Contest 2018.
The task is to evaluate batches of join queries on a set of pre-defined relations. Each join query specifies two relations, multiple (equality) join predicates, and one (sum) aggregation. The challenge is to fully utilize the CPU and memory resources and execute the queries as fast as possible.
NOTE: go 1.12 is required
The simple interface Join(f0, f1 string, offset0, offset1 []int) (sum uint64)
is defined in join.go
. Our test harness will feed two relations and two columns' offsets array to the interface every time, and check the correctness of the output result. Explaination to the four input arguments and one output argument of the interface are list as follows:
- f0: File name of the given relation0.
- f1: File name of the given relation1.
- offset0: Offsets of which columns the given relation0 should be joined.
- offset1: Offsets of which columns the given relation1 should be joined.
- sum (output argument): Sum of the relation0.col0 in the final join result.
The (equality) join predicates are specified by the offset0/1
. The form of the join predicates is like:
relation0.cols[offset[0]] = relation1.cols[offset[0]] and relation0.cols[offset[1]] = relation1.cols[offset[1]]...
Example: Join("/path/T0", "/path/T1", []int{0, 1}, []int{2, 3})
Translated to SQL:
SELECT SUM(T0.COL0)
FROM T0, T1
ON T0.COL0=T1.COL2 AND T0.COL1=T1.COL3
We provide a sample as join_example.go: JoinExample
which performs a simple hash join algorithm. It uses the relation0 to build the hash table, and probe the hash table for every row in relation1.
- (30%) Pass all test cases.
- (20%) Perform better than
join_example.go:JoinExample
. - (20%) Profile your program with
pprof
, analyze the performance bottleneck. - (15%) Keep a good code style.
- (15%) Document your idea and code.
Note:
- For your check sums, you do not have to worry about numeric overflows as long as you are using 64 bit unsigned integers.
- More large datasets are provided here, you can use them to help profile your program.
- We'll use the
BenchmarkJoin
andBenchmarkJoinExample
which can be found inbenchmark_test.go
to evaluate your program. Test data will NOT be outside of what we've provided.
- Please implement your own
Join
in join.go to accomplish this task. - We provide CSV versions (.tbl files) of three relations in directory
t
. You can load them into a DBMS to test your program.- r0.tbl: 2 columns * 10,000 records
- r1.tbl: 4 columns * 5,000 records
- r2.tbl: 4 columns * 500 records
- There is already a built-in unit test
JoinTest
defined injoin_test.go
. You can write your own unit tests, but please make sureJoinTest
can be passed before. - Use
make test
to run all the unit tests. - Use
make bench
to run all the benchmarks.