Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload to GitHub Co-clustering MatLab code and convert it to Python to run on Spark. #26

Open
EIzquierdo opened this issue Jun 26, 2017 · 9 comments
Assignees

Comments

@EIzquierdo
Copy link

No description provided.

@romulogoncalves romulogoncalves changed the title Upload on GitHub Co-clustering MatLab code Upload to GitHub Co-clustering MatLab code and convert it to Python to run on Spark. Jun 26, 2017
@romulogoncalves
Copy link
Contributor

The code should be uploaded, but then converted to Scala or Python to run on Spark.
The idea is to exploit the matrix management offered by Spark.

@romulogoncalves
Copy link
Contributor

@EIzquierdo the code should be uploaded to a different branch the existing ones, i.e., applications and master. You can call it co_clustering. There we will do all the development of the code to run it on Spark afterwards. Once it is stable we merge it with the master.

@rzuritamilla
Copy link
Contributor

@EIzquierdo @romulogoncalves
The co-clustering code is also available in R: https://github.com/fnyanez/bbac
can we use SparkR to run it in our cluster without having to convert it to Python or Scala?
Raul

PS: there are other co-clustering methods in github... Perhaps we should check https://github.com/jbendahan/spark_cocluster

@romulogoncalves
Copy link
Contributor

@raul
Sure, for me means less work, but aren't the co-clustering algorithms different? Not sure what Emma did in matlab. @EIzquierdo what do you think?

The R version of it is a nice exercise and can be used as example Spark-R.
The R code I see runs everything in memory where the R kernel is running, we will need to do some changes to have it exploiting Spark, i.e., to push done some computations to Spark.

The other one in Python I will study it. I need to see how the data is distributed and accessed.

@rzuritamilla
Copy link
Contributor

@romulogoncalves:
I believe that the R and the MATLAB code are identical.
(FYI we did not develop the co-clustering algorithm in MATLAB)

I think that it would be nice to test Spark-R after adapting the R code so that some of the computations can be pushed down to Spark (I naively believed that you could just call any R function and that SparkR will do the required adjustments...)

I think that studying the python example might bring some inspiration on how to adapt the MATLAB code (if we do want to use the R version or if that gets complicated)
Also, @EIzquierdo and I should check how this algorithm differ from the one that we used in the past.

@romulogoncalves
Copy link
Contributor

@EIzquierdo did you already check if the algorithm differs or not?
We need to know to start integrating the R code or not into Spark.

@EIzquierdo
Copy link
Author

@romulogoncalves yes, I have checked it. The scheme 2, which is the scheme that we have used, is equal than in Matlab.

@romulogoncalves
Copy link
Contributor

I was now trying to make the R code (schema 2) to run in distributed mode.

I decided to use Spark DataFrames because they are equivalent to R and Python Dataframes. The transpose matrix we used for Kmeans is saved as a Parquet file so it can be loaded in R and Python as a Dataframe.

To work in distributed mode we will have run the matrix computations over the Spark DataFrames. However, that is hard to achieve in R due to limited API. For example, for the similarity_measure the function needs an empty DataFrame which it is not yet supported by SparkR.

We need to implement it either in Scala or Python.

@rzuritamilla
Copy link
Contributor

Hi @romulogoncalves,

Thanks for the update! I am afraid that I do not understand all the details but it looks like "everything" needs to be rewritten (and in that sense there is actually little gain in having the R code ready...)
Again, i do not understand all the details but the empty DataFrame can be simulated by creating a frame full of zeros or NaNs. Btw, this need is probably a legacy from MATLAB (which work faster if the size of the array (dataFrame) is defined a priori -- instead of having to adjust the size of an array dynamically and on the fly...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants