-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK 5280] RDF Loader added + documentation #4650
Conversation
TODO: test + comment
TODO: test + comment
Test build #27643 has finished for PR 4650 at commit
|
style errors fixed |
Test build #27680 has finished for PR 4650 at commit
|
Please add tests for your RDF loader, and see my codes as an example: BTW, I think that it'd would be better to divide an interface and the implementations e.g.,)
o.a.spark.graphx.impl.loader.RDFLoader Thought? |
Will add tests soon. I was also thinking about making one interface for different loaders (with a load() method instead of edgeListFile()) and maybe a facade combining all loaders. |
Added test + a test file (small excerpt from DBpedia 3.9) + small fix in RDFLoader |
Test build #27822 has finished for PR 4650 at commit
|
Test build #27817 has finished for PR 4650 at commit
|
Test build #27823 has finished for PR 4650 at commit
|
Test build #27825 has finished for PR 4650 at commit
|
Test build #27826 has finished for PR 4650 at commit
|
@maropu tests are added and build tests passed. Is it ready for merging now? |
I would really like to have this - is it going to be merged ? |
+1, Is this going to be merged? |
@emir-munoz @blankdots This PR is totally stale, so it'd better to refactor this if you're interested in. |
thanks @maropu I will take a closer look into it |
A Spark plugin seems like a much better approach, I've done some experimentation on a plugin for this which seems like a much cleaner and lightweight approach though I have no idea if it will ever be open sourced (it is work for my employer) or move beyond the experimentation stage. I would strongly suggest looking at leveraging existing libraries like Apache Jena to do a lot of the work for you and avoid having to implement your own RDF parsers. Personally I am using the Apache Jena Elephas modules for this since the Hadoop Input formats can simply be used by Spark. (Disclaimer - I am a committer on Apache Jena and the main contributor to the Elephas modules) |
@maropu @emir-munoz also considering taking a look at it. |
Can one of the admins verify this patch? |
def gethash(in:String):Long = {
var h = 1125899906842597L
for (x <- in) {
h = 31 * h + x;
}
return h
} This hash function used to calculate vertexId seems to be weak. For example, the hash values of these URI are same (-3127496886112549146).
|
@lukovnikov if there is still interest in this, the best approach would be to first release something in spark-packages.org as a set of utilities to create Graphs. Using existing 3rd party Hadoop formats makes the most sense as per @rvesse. Could you close this PR please? |
+1 to making this a Spark package. I would recommend that we close this PR since it's gone stale. |
Have been testing it with DBpedia dumps, works well so far.
Any help with custom partitioning and optimization is welcome.