-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload to GitHub Co-clustering MatLab code and convert it to Python to run on Spark. #26
Comments
The code should be uploaded, but then converted to Scala or Python to run on Spark. |
@EIzquierdo the code should be uploaded to a different branch the existing ones, i.e., applications and master. You can call it co_clustering. There we will do all the development of the code to run it on Spark afterwards. Once it is stable we merge it with the master. |
@EIzquierdo @romulogoncalves PS: there are other co-clustering methods in github... Perhaps we should check https://github.com/jbendahan/spark_cocluster |
@raul The R version of it is a nice exercise and can be used as example Spark-R. The other one in Python I will study it. I need to see how the data is distributed and accessed. |
@romulogoncalves: I think that it would be nice to test Spark-R after adapting the R code so that some of the computations can be pushed down to Spark (I naively believed that you could just call any R function and that SparkR will do the required adjustments...) I think that studying the python example might bring some inspiration on how to adapt the MATLAB code (if we do want to use the R version or if that gets complicated) |
@EIzquierdo did you already check if the algorithm differs or not? |
@romulogoncalves yes, I have checked it. The scheme 2, which is the scheme that we have used, is equal than in Matlab. |
I was now trying to make the R code (schema 2) to run in distributed mode. I decided to use Spark DataFrames because they are equivalent to R and Python Dataframes. The transpose matrix we used for Kmeans is saved as a Parquet file so it can be loaded in R and Python as a Dataframe. To work in distributed mode we will have run the matrix computations over the Spark DataFrames. However, that is hard to achieve in R due to limited API. For example, for the similarity_measure the function needs an empty DataFrame which it is not yet supported by SparkR. We need to implement it either in Scala or Python. |
Hi @romulogoncalves, Thanks for the update! I am afraid that I do not understand all the details but it looks like "everything" needs to be rewritten (and in that sense there is actually little gain in having the R code ready...) |
No description provided.
The text was updated successfully, but these errors were encountered: