Upload to GitHub Co-clustering MatLab code and convert it to Python to run on Spark. #26

EIzquierdo · 2017-06-26T09:13:59Z

No description provided.

romulogoncalves · 2017-06-26T10:19:35Z

The code should be uploaded, but then converted to Scala or Python to run on Spark.
The idea is to exploit the matrix management offered by Spark.

romulogoncalves · 2017-06-26T10:22:23Z

@EIzquierdo the code should be uploaded to a different branch the existing ones, i.e., applications and master. You can call it co_clustering. There we will do all the development of the code to run it on Spark afterwards. Once it is stable we merge it with the master.

rzuritamilla · 2017-06-30T09:01:25Z

@EIzquierdo @romulogoncalves
The co-clustering code is also available in R: https://github.com/fnyanez/bbac
can we use SparkR to run it in our cluster without having to convert it to Python or Scala?
Raul

PS: there are other co-clustering methods in github... Perhaps we should check https://github.com/jbendahan/spark_cocluster

romulogoncalves · 2017-06-30T09:25:34Z

@raul
Sure, for me means less work, but aren't the co-clustering algorithms different? Not sure what Emma did in matlab. @EIzquierdo what do you think?

The R version of it is a nice exercise and can be used as example Spark-R.
The R code I see runs everything in memory where the R kernel is running, we will need to do some changes to have it exploiting Spark, i.e., to push done some computations to Spark.

The other one in Python I will study it. I need to see how the data is distributed and accessed.

rzuritamilla · 2017-06-30T12:38:22Z

@romulogoncalves:
I believe that the R and the MATLAB code are identical.
(FYI we did not develop the co-clustering algorithm in MATLAB)

I think that it would be nice to test Spark-R after adapting the R code so that some of the computations can be pushed down to Spark (I naively believed that you could just call any R function and that SparkR will do the required adjustments...)

I think that studying the python example might bring some inspiration on how to adapt the MATLAB code (if we do want to use the R version or if that gets complicated)
Also, @EIzquierdo and I should check how this algorithm differ from the one that we used in the past.

romulogoncalves · 2017-07-10T07:32:33Z

@EIzquierdo did you already check if the algorithm differs or not?
We need to know to start integrating the R code or not into Spark.

EIzquierdo · 2017-07-10T08:40:03Z

@romulogoncalves yes, I have checked it. The scheme 2, which is the scheme that we have used, is equal than in Matlab.

romulogoncalves · 2017-07-10T15:04:44Z

I was now trying to make the R code (schema 2) to run in distributed mode.

I decided to use Spark DataFrames because they are equivalent to R and Python Dataframes. The transpose matrix we used for Kmeans is saved as a Parquet file so it can be loaded in R and Python as a Dataframe.

To work in distributed mode we will have run the matrix computations over the Spark DataFrames. However, that is hard to achieve in R due to limited API. For example, for the similarity_measure the function needs an empty DataFrame which it is not yet supported by SparkR.

We need to implement it either in Scala or Python.

rzuritamilla · 2017-07-10T15:47:50Z

Hi @romulogoncalves,

Thanks for the update! I am afraid that I do not understand all the details but it looks like "everything" needs to be rewritten (and in that sense there is actually little gain in having the R code ready...)
Again, i do not understand all the details but the empty DataFrame can be simulated by creating a frame full of zeros or NaNs. Btw, this need is probably a legacy from MATLAB (which work faster if the size of the array (dataFrame) is defined a priori -- instead of having to adjust the size of an array dynamically and on the fly...)

EIzquierdo assigned romulogoncalves Jun 26, 2017

EIzquierdo added the Storage label Jun 26, 2017

romulogoncalves changed the title ~~Upload on GitHub Co-clustering MatLab code~~ Upload to GitHub Co-clustering MatLab code and convert it to Python to run on Spark. Jun 26, 2017

romulogoncalves assigned EIzquierdo Jun 26, 2017

romulogoncalves added the Spark label Jun 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload to GitHub Co-clustering MatLab code and convert it to Python to run on Spark. #26

Upload to GitHub Co-clustering MatLab code and convert it to Python to run on Spark. #26

EIzquierdo commented Jun 26, 2017

romulogoncalves commented Jun 26, 2017

romulogoncalves commented Jun 26, 2017

rzuritamilla commented Jun 30, 2017

romulogoncalves commented Jun 30, 2017

rzuritamilla commented Jun 30, 2017

romulogoncalves commented Jul 10, 2017

EIzquierdo commented Jul 10, 2017

romulogoncalves commented Jul 10, 2017

rzuritamilla commented Jul 10, 2017

Upload to GitHub Co-clustering MatLab code and convert it to Python to run on Spark. #26

Upload to GitHub Co-clustering MatLab code and convert it to Python to run on Spark. #26

Comments

EIzquierdo commented Jun 26, 2017

romulogoncalves commented Jun 26, 2017

romulogoncalves commented Jun 26, 2017

rzuritamilla commented Jun 30, 2017

romulogoncalves commented Jun 30, 2017

rzuritamilla commented Jun 30, 2017

romulogoncalves commented Jul 10, 2017

EIzquierdo commented Jul 10, 2017

romulogoncalves commented Jul 10, 2017

rzuritamilla commented Jul 10, 2017