Polystore implementation inspired by http://wp.sigmod.org/?p=1629
Getting a consolidated view across data sources for better analytics. An example use case is looking at the movie lens data from the perpective of actual population distribution. This helps us understand if in some area (zip) we have more users/population ratio, which is a better populatrity indicator than just plain number of users.
Seamless interface to disparate systems is tough. To understand the complexity refer Michael Stonebraker's blog post. As the types of data and data sources have exploded, most interesting things can be pursued by the global view of the data. But as each data type and source works as an isolated island, getting this consolidated view is really tough. The original idea by and subsequent work by BigDAWG are great. If we focus just on data joins for analytics, we can relax some of the constraints mentioned in BigDAWG guiding tenets and still accomplish a lot in a easier way with Spark and specifically Catalyst, this implementation is an attempt at that.
Anyone interested in data. "Data is the new oil" or any of the umpteen cliched lines and their target audience with a liberal dose of big/fast/I-have-no-clue data.
The first use case with movie lens and US population will be up by 8th May 2016. Of course, this is a projection from me a person with a very high rate of failure and a proven track record of goofing up on all personal project work.
Arrive at the user:population ratio for any zip code.
All experiments depend on data folder which is not checked in. If you want to run them, create a folder called data and add the following data sets:
Tools is another folder which is not checked in, checkout
Why not mimic II For all non-researchers like me getting to that data is tough, but the novelty of the data is in the different types of data that it already has.