Please implement the code for the following tasks and submit a zip file via Canvas. Make sure to write your code in a way that it would work in a real distributed execution in Hadoop. You can use the unit tests as a means to debug and validate your implementation. Note that a successful unit test execution does not necessary mean that your solution is 100% correct.
Make sure that your submitted code compiles.
Implement a sparse vector backed by a hashmap in the class SparseVector. Next, please implement a distributed matrix vector multiplication via a broadcast-join in the class SparseMatrixVectorMultiplication, analogous to the implementation from our exercises. Please use a dense representation for all vectors with a sparsity of less than 50% and a sparse representation otherwise.
You can test your implementation with the following unit tests:
./run_docker.sh mvn -Dtest=nl.uva.bigdata.hadoop.assignment2.SparseMatrixVectorMultiplicationLocalTest test
./run_docker.sh mvn -Dtest=nl.uva.bigdata.hadoop.assignment2.SparseMatrixVectorMultiplicationClusterTest test
In this task, we stop using Hadoop and implement our own (local) MapReduce engine in MapReduceEngine.. This reverses the previous tasks, now we are given the map and reduce implementations for word counting and we have to implement the underlying engine according to the three phases of execution in MapReduce.
You can test your implementation with the following unit test:
./run_docker.sh mvn -Dtest=nl.uva.bigdata.hadoop.assignment2.MapReduceEngineTest test
Your final task is to implement distributed linear regression (as discussed in the class) on top of your own MapReduce engine in DistributedLinearRegression. Compute outer products in the mapper, sum up the intermediate results in the reducer and solve the corresponding linear system.
You can test your implementation with the following unit test:
./run_docker.sh mvn -Dtest=nl.uva.bigdata.hadoop.assignment2.DistributedLinearRegressionTest test